## Fine Tuning T5 model with Azure ML using Azure Container for PyTorch 

This tutorial shows how to fine tune the T5 model to generate a summary of a news article. We then deploy it to an online endpoint for real time inference. The model is trained on a tiny sample of the dataset with a small number of epochs to illustrate the fine tuning approach.

### Learning Objectives
- Fine tune the T5 small model for the `Summarization` task with `Azure ML` 
- Leverage the `ACPT` environment with state of art accelerators
- Increase training efficiency using [`DeepSpeed`](https://github.com/microsoft/DeepSpeed) and [`ONNX Runtime`](https://github.com/microsoft/onnxruntime)
- Model Evaluation uring prebuilt component
- Register the model with AzureML
- Deploy and inference using MIR and ONNX Runtime


###### T5-small is a 60 million parameter model based on text-to-text framework and is used for several NLP tasks, including machine translation, document summarization, question answering pretrained on Colossal Clean Crawled Corpus (C4) dataset.

translation (green), linguistic acceptability (red), sentence similarity (yellow), and document summarization (blue)

##### In this workshop, we will be fine tuning the document summarization task.


![Image](assets/t5modelcard.PNG)

#### 1. Prerequisites to install Azure ML Python SDK Version 2 
Please restart kernel after pip installs to sync environment with new modules.

In [4]:
%pip install azure-ai-ml azure-identity datasets azure-cli mlflow
!az login --use-device-code
# %pip install onnxruntime transformers torch

[0mNote: you may need to restart the kernel to use updated packages.
[93mTo sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code PBJWCSSQQ to authenticate.[0m
[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "72f988bf-86f1-41af-91ab-2d7cd011db47",
    "id": "a1ffc958-d2c7-493e-9f1e-125a0477f536",
    "isDefault": true,
    "managedByTenants": [],
    "name": "MSFT-MVD-05-Shared-EUDB",
    "state": "Enabled",
    "tenantId": "72f988bf-86f1-41af-91ab-2d7cd011db47",
    "user": {
      "name": "prathikrao@microsoft.com",
      "type": "user"
    }
  },
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "72f988bf-86f1-41af-91ab-2d7cd011db47",
    "id": "b74d5345-100f-408a-a7ca-47abb52ba60d",
    "isDefault": false,
    "managedByTenants": [
      {
        "tenantId": "2f4a9838-26b7-47ee-be60-ccc1fdec5953"
      }
    ],
    "name": "DSE EDog",
    "state": "Enabled",
    "tenantId": "72f988bf-86f1-41af-91ab-2d7cd011db47",
    "use

#### 2. Connect to Azure Machine Learning workspace

Before we dive in the code, you'll need to connect to your workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

For this lab, we've already setup an AzureML Workspace for you. If you'd like to learn more about `Workspace`s, please reference [`AzureML's documentation`](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?view=azureml-api-2&tabs=azure-portal).

We are using the `DefaultAzureCredential` to get access to workspace. `DefaultAzureCredential` should be capable of handling most scenarios. If you want to learn more about other available credentials, go to [`Set up authentication`](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk&view=azureml-api-2) for more available credentials.

In [5]:
from azure.ai.ml import MLClient
from azure.identity import (
    AzureCliCredential,
    InteractiveBrowserCredential,
    ClientSecretCredential,
)

try:
    credential = AzureCliCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

try:
    ml_client = MLClient.from_config(credential=credential)
except:
    ml_client = MLClient(
        credential,
        subscription_id="ed2cab61-14cc-4fb3-ac23-d72609214cfd",
        resource_group_name="AMLDataCache",
        workspace_name="datacachetest",
    )



#### 3. Create a compute

Azure Machine Learning needs a compute resource to run a job. This resource can be single or multi-node machines with Linux or Windows OS. In the following example script, we provision a `Standard_ND40rs_v2` SKU which is infiniband enabled to provide higher node communication bandwidth and low latency with mellanox drivers to create an Azure Machine Learning compute. You can get the list and more detail [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-hpc#rdma-capable-instances)

In [None]:
from azure.ai.ml.entities import AmlCompute

experiment_name = "T5-Summarization-news-summary"

# If you already have a gpu cluster, mention it here. Else will create a new one
compute_cluster = "AMLBuild23Compute"
try:
    compute = ml_client.compute.get(compute_cluster)
    print("successfully fetched compute:", compute.name)
except Exception as ex:
    print("failed to fetch compute:", compute_cluster)
    print("creating new Standard_ND40rs_v2 compute")
    compute = AmlCompute(
        name=compute_cluster,
        size="Standard_ND40rs_v2", # Info on Standard_ND40rs_v2 SKU: https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series
        min_instances=1,
        max_instances=2,  # For multi node training set this to an integer value more than 1
    )
    ml_client.compute.begin_create_or_update(compute).wait()
    print("successfully created compute:", compute.name)


#### 4. Create a job environment using Azure Container for Pytorch

We will be creating a custom environment using existing ACPT curated environment consisting of state of art technologies like Deepspeed, OnnxRuntime. You can get more detail from [Custom Environment](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-azure-container-for-pytorch-environment?view=azureml-api-2)


view the [Environments in Azure Machine Learning studio](https://ml.azure.com/environments)

In [6]:
from azure.ai.ml.entities import Environment, BuildContext

Env_Name = "MSBuildLab110_env"
env_docker_context = Environment(
    build=BuildContext(path="src/Environment/context"),
    name=Env_Name,
    description="Environment created from a Docker context.",
)
ml_client.environments.create_or_update(env_docker_context)


[32mUploading context (0.0 MBs): 100%|██████████| 317/317 [00:00<00:00, 1503.75it/s]
[39m



Environment({'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'MSBuildLab110_env', 'description': 'Environment created from a Docker context.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourceGroups/AMLDataCache/providers/Microsoft.MachineLearningServices/workspaces/datacachetest/environments/MSBuildLab110_env/versions/11', 'Resource__source_path': None, 'base_path': '/bert_ort/prathikrao/onnxruntime-training-examples/T5', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fa7c8436cd0>, 'serialize': <msrest.serialization.Serializer object at 0x7fa7c8436fd0>, 'version': '11', 'latest_version': None, 'conda_file': None, 'image': None, 'build': <azure.ai.ml.entities._assets.environment.BuildContext object at 0x7fa7c84366a0>, 'inference_config': None, 'os_type': 'Linux', 'arm_type': 'environment_version', 'conda_file_

#### 5. Pick the dataset for fine-tuning the model

The [CNN DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. It is larger than 1GB when uncompressed. 

We want this sample to run quickly, so a copy of the fraction of dataset is used for fine tuning job.This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. 
* Visualize some data rows. 

In [None]:
dataset_name = "cnn_dailymail"
import pandas as pd
pd.set_option(
    "display.max_colwidth", 1000
)
train_df = pd.read_json("./src/Finetune/cnn_daily.jsonl", lines=True)
train_df.head(10)

#### 6. Finetune the T5 small model for Summarization task

Leveraging Deepspeed and Onnxruntime accelarators for improving the efficiency for memory and compute and in turn reduce the training cost. 

The table below details some of the parameters passed to the training job.

| Parameters/accelarators | Description |
| ----------------- | --- |
| model_name | The name of the model getting finetuned. Here we specify T5-small. |
| ort | [Onnx runtime](https://github.com/microsoft/onnxruntime) accelarates 2x speed up in training time for SOTA models and optimizes memory to fit larger model such as GPT3 on 16GB GPU which would otherwise run out of mmemory. |
| deepspeed | [Deepspeed](https://github.com/microsoft/deepspeed) enables running billions of parameter models distributed across GPUs and provide different stages for memory and compute efficiency. |
| number of epochs | 1 |
| max train samples | 10 |
| Nebula | checkpointing |


In [8]:
from azure.ai.ml import command, Input, Output
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

from azure.ai.ml.entities import (
    VsCodeJobService,
    TensorBoardJobService,
    JupyterLabJobService,
)

job = command(
    code=".",
    command="python src/Finetune/train_summarization_deepspeed_optum.py \
        --deepspeed src/Finetune/ds_config.json \
        --model_name_or_path t5-small \
        --dataset_name cnn_dailymail \
        --max_train_samples=10 \
        --max_eval_samples=10 \
        --dataset_config '3.0.0' \
        --do_train \
        --num_train_epochs=1 \
        --per_device_train_batch_size=16 \
        --per_device_eval_batch_size=16  \
        --output_dir outputs \
        --overwrite_output_dir \
        --fp16 \
        --optim adamw_ort_fused",
    compute="v100",
    services={
      "My_jupyterlab": JupyterLabJobService(
        nodes="all" # For distributed jobs, use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node. Values are "all", or compute node index (for ex. "0", "1" etc.)
      ),
      "My_vscode": VsCodeJobService(
        nodes="all"
      ),
      "My_tensorboard": TensorBoardJobService(
        nodes="all",
        log_dir="outputs/runs"  # relative path of Tensorboard logs (same as in your training script)         
      ),
    },
    environment="MSBuildLab110_env@latest",
    instance_count=1,  
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 8,
    },
) # basic environment comes with my workspace

job = ml_client.jobs.create_or_update(job)
job.studio_url

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading T5 (16.03 MBs): 100%|██

'https://ml.azure.com/runs/yellow_cumin_297vz56yxz?wsid=/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourcegroups/AMLDataCache/workspaces/datacachetest&tid=72f988bf-86f1-41af-91ab-2d7cd011db47'

#### Results show **300%** improvement of Fine-tune job with 100 epoch and CNN_Daily dataset with ORT, Deepspeed and Nebula checkpointing

![Image](assets/Performance_100epoch.PNG)





![Image](assets/Noaccelarator.PNG)





![Image](assets/dsandort.PNG)


#### 7. Register the fine tuned model with the workspace
**NOTE: STEP 6 FINE-TUNE JOB MUST COMPLETE BEFORE RUNNING THIS CELL**

We will register the model from the output of the fine tuning job. This will track lineage between the fine tuned model and the fine tuning job. The fine tuning job, further, tracks lineage to the foundation model, data and training code.

In [None]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes
import time

timestamp = str(int(time.time()))
model_name = "T5Model"

#MLFlow model registration
mlflow_modelpath = "azureml://jobs/{jobname}/outputs/artifacts/outputs/mlflow".format(jobname = job.name)
cloud_model = Model(
    path=mlflow_modelpath,
    name=model_name+"_mlflow",
    type=AssetTypes.MLFLOW_MODEL,
    description="Model created from cloud path.",
    version=timestamp,
)
ml_client.models.create_or_update(cloud_model)

#### 8. Model Evaluation
The goal of evaluating model is to compare their performance on a variety of metrics. text-summarization is generic task type that can be used for scenarios such as abstractive and extractive summarization. 

We will create the job that uses the model_evaluation_pipeline component and submit for the registered model.

Note that the metrics that the evaluation job calculate are **rouge1, rouge2, rougeL and rougeLsum** in this sample.


##### 8.1 Fetch the prebuilt fine tuning model evaluation component

In [None]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
import time


test_data = "src/Finetune/small_test-inference.jsonl"

# fetch the pipeline component
registry = "azureml"
subscription_id = ml_client.subscription_id
resource_group = ml_client.resource_group_name

registry = "azureml"

registry_ml_client = MLClient(
    credential, subscription_id, resource_group, registry_name=registry
)

pipeline_component_func = registry_ml_client.components.get(
    name="model_evaluation_pipeline", label="latest"
)
model = ml_client.models.get(name=model_name+"_mlflow", version = timestamp)

# define the pipeline job
@pipeline()
def evaluation_pipeline(mlflow_model):
    evaluation_job = pipeline_component_func(
        # specify the foundation model available in the azureml system registry or a model from the workspace
        # mlflow_model = Input(type=AssetTypes.MLFLOW_MODEL, path=f"{mlflow_model_path}"),
        mlflow_model=mlflow_model,
        # test data
        test_data=Input(type=AssetTypes.URI_FILE, path=test_data),
        # The following parameters map to the dataset fields
        input_column_names="article",
        label_column_name="highlights",
        # Evaluation settings
        task="text-summarization",
        # config file containing the details of evaluation metrics to calculate
        # evaluation_config=Input(type=AssetTypes.URI_FILE, path="eval-config.json"),
        # config cluster/device job is running on
        # set device to GPU/CPU on basis if GPU count was found
        device="gpu",
    )
    return {"evaluation_result": evaluation_job.outputs.evaluation_result}

# submit the pipeline job for each model that we want to evaluate
# you could consider submitting the pipeline jobs in parallel, provided your cluster has multiple nodes

pipeline_jobs = []


pipeline_object = evaluation_pipeline(
    mlflow_model=Input(type=AssetTypes.MLFLOW_MODEL, path=f"{model.id}"),
)
# don't reuse cached results from previous jobs
pipeline_object.settings.force_rerun = True
pipeline_object.settings.default_compute = compute_cluster
pipeline_object.display_name = f"eval-{model.name}-{timestamp}"
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_object, experiment_name=experiment_name
)
# add model['name'] and pipeline_job.name as key value pairs to a dictionary
pipeline_jobs.append({"model_name": model.name, "job_name": pipeline_job.name})
# wait for the pipeline job to complete
ml_client.jobs.stream(pipeline_job.name)



##### 8.2 Review metrics.
Viewing the job in AzureML studio is the best way to analyze logs, metrics and outputs of jobs. You can create custom charts and compare metics across different jobs

![Image](assets/modelevaluation.PNG)

#### 9. Operationalizing the model

##### 9.1 Register Onnx model

In [None]:
timestamp = str(int(time.time()))
model_name = "T5Model"

#Onnx model registration
modelpath = "azureml://jobs/{jobname}/outputs/artifacts/outputs/onnx".format(jobname = job.name)
cloud_model = Model(
    path=modelpath,
    name=model_name+"_onnx",
    type=AssetTypes.CUSTOM_MODEL,
    description="Model created from cloud path.",
    version=timestamp,
)
ml_client.models.create_or_update(cloud_model)

##### 9.2 Create online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [None]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)
# Define an endpoint name
endpoint_name = "MSBuildLab110_endpoint"

# Example way to define a random name
import datetime

endpoint_name = "endpt-" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name = endpoint_name, 
    description="this is a endpoint for T5 summarization model",
    auth_mode="key"
)

ml_client.online_endpoints.begin_create_or_update(endpoint).wait()
ml_client.begin_create_or_update(endpoint).result()

##### 9.3 Deploy scoring file to the endpoint

In [None]:
env = Environment(
    image="mcr.microsoft.com/azureml/curated/acpt-t5:latest",
)

model = ml_client.models.get(name=model_name+"_onnx", version = timestamp)

blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code=".", scoring_script="src/Operationalize/score_onnx.py"
    ),
    instance_type="Standard_F8s_v2",
    instance_count=1,
)

ml_client.online_deployments.begin_create_or_update(blue_deployment)

#### Aside: Scoring files for ONNX Runtime Inference vs. Hugging Face Inference

![Image](assets/T5_beamsearch.PNG)

In [None]:
import json 
import numpy as np
from onnxruntime import InferenceSession
import os
import time
from transformers import AutoTokenizer

# Documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-online-endpoints
# Troubleshooting: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints
  
# The init() method is called once, when the web service starts up.
def init():  
    global SESS
    global TOKENIZER
    # The AZUREML_MODEL_DIR environment variable indicates  
    # a directory containing the model file you registered.  
    # model_filename = os.path.join(os.environ['AZUREML_MODEL_DIR'], "onnx/outputs_beam_search.onnx")  

    model_filename = "src/Model/onnx/outputs_beam_search.onnx" 
    SESS = InferenceSession(model_filename, providers=["CPUExecutionProvider"])

    TOKENIZER = AutoTokenizer.from_pretrained("t5-small")
  
# The run() method is called each time a request is made to the scoring API.  
def run(data):
    json_data = json.loads(data)
    input_data = json_data["inputs"]["article"]
    
    input_ids = TOKENIZER(str(input_data), return_tensors="pt").input_ids

    ort_inputs = {
        "input_ids": np.array(input_ids, dtype=np.int32),
        "max_length": np.array([512], dtype=np.int32),
        "min_length": np.array([0], dtype=np.int32),
        "num_beams": np.array([1], dtype=np.int32),
        "num_return_sequences": np.array([1], dtype=np.int32),
        "length_penalty": np.array([1.0], dtype=np.float32),
        "repetition_penalty": np.array([1.0], dtype=np.float32)
    }
    
    out = SESS.run(None, ort_inputs)[0][0] # 0th batch, 0th sample

    summary = TOKENIZER.decode(out[0], skip_special_tokens=True)

    # You can return any JSON-serializable object.
    return {"summary": summary}

def test():
    # NOTE: You need to comment out model_filename = os.path.join(...) in init() for local testing
    init()
    payload = {
        "inputs": {
            "article": ["summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."],
            "params": {
                "max_new_tokens": 512
            }
        }
    }
    payload = str.encode(json.dumps(payload))
    res = run(payload)
    print(res)

    # timed run
    start = time.time()
    for i in range(10):
        _ = run(payload)
    diff = time.time() - start
    print(f"time {diff/10} sec")

test()


In [None]:
import numpy as np
import os
from transformers import pipeline
import json 
import time
import joblib
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForSeq2SeqLM
import torch

# Documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-online-endpoints
# Troubleshooting: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints
  
# The init() method is called once, when the web service starts up.
def init():  
    global MODEL
    global TOKENIZER
    # The AZUREML_MODEL_DIR environment variable indicates  
    # a directory containing the model file you registered.  
    # model_path = os.path.join(os.environ['AZUREML_MODEL_DIR'])
    # model_file = os.path.join(os.environ['AZUREML_MODEL_DIR'], "pytorch_model.bin")

    model_path = "src/Model"
    model_file = "src/Model/pytorch_model.bin"
    TOKENIZER = AutoTokenizer.from_pretrained(model_path)
    config = AutoConfig.from_pretrained(model_path)
    MODEL = AutoModelForSeq2SeqLM.from_pretrained(model_file, config=config) 
    
  
# The run() method is called each time a request is made to the scoring API.  
def run(data):
    json_data = json.loads(data)
    input_data = json_data["inputs"]["article"]
    inputs = TOKENIZER(str(input_data), return_tensors="pt").input_ids

    out = MODEL.generate(inputs, max_new_tokens=512, do_sample=False)

    summary = TOKENIZER.decode(out[0], skip_special_tokens=True)
      
    # You can return any JSON-serializable object.  
    return {"summary": summary}

    
def test():
    # NOTE: You need to comment out model_file/path = os.path.join(...) in init() for local testing
    init()
    payload = {
        "inputs": {
            "article": ["summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."],
            "params": {
                "max_new_tokens": 512
            }
        }
    }
    payload = str.encode(json.dumps(payload))
    res = run(payload)
    print(res)
    
    # timed run
    start = time.time()
    for i in range(10):
        _ = run(payload)
    diff = time.time() - start
    print(f"time {diff/10} sec")

test()

##### 9.4: Invoke the endpoint to score data by using your model
**NOTE: STEP 9.3 ENDPOINT DEPLOYMENT MUST COMPLETE BEFORE RUNNING THIS CELL**

Test the blue deployment with some sample data


In [None]:
ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name="blue",
    request_file="src/Operationalize/payload.json",
)

##### 9.5: Delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [None]:
ml_client.online_endpoints.begin_delete(name=endpoint_name).wait()