## Fine Tuning T5 model with Azure ML using Azure Container for PyTorch 

This tutorial shows how to fine tune the T5 model to generate a summary of a news article. We then deploy it to an online endpoint for real time inference. The model is trained on a tiny sample of the dataset with a small number of epochs to illustrate the fine tuning approach.

### Learning Objectives
- Fine tune the T5 small model for the `Summarization` task with `Azure ML` 
- Leverage the `ACPT` environment with state of art accelerators
- Increase training efficiency using [`DeepSpeed`](https://github.com/microsoft/DeepSpeed) and [`ONNX Runtime`](https://github.com/microsoft/onnxruntime)
- Model Evaluation uring prebuilt component
- Register the model with AzureML
- Deploy and inference using MIR and ONNX Runtime


###### T5-small is a 60 million parameter model based on text-to-text framework and is used for several NLP tasks, including machine translation, document summarization, question answering pretrained on Colossal Clean Crawled Corpus (C4) dataset.

translation (green), linguistic acceptability (red), sentence similarity (yellow), and document summarization (blue)

##### In this workshop, we will be fine tuning the document summarization task.


![Image](assets/t5modelcard.PNG)

#### 1. Prerequisites to install Azure ML Python SDK Version 2 
Please restart kernel after pip installs to sync environment with new modules.

In [42]:
%pip install azure-ai-ml azure-identity datasets azure-cli
%pip install onnxruntime==1.15.1 transformers==4.29.2 torch==2.0

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


In [43]:
# !az login --use-device-code

#### 2. Connect to Azure Machine Learning workspace

Before we dive in the code, you'll need to connect to your workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

For this lab, we've already setup an AzureML Workspace for you. If you'd like to learn more about `Workspace`s, please reference [`AzureML's documentation`](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?view=azureml-api-2&tabs=azure-portal).

We are using the `DefaultAzureCredential` to get access to workspace. `DefaultAzureCredential` should be capable of handling most scenarios. If you want to learn more about other available credentials, go to [`Set up authentication`](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk&view=azureml-api-2) for more available credentials.

In [44]:
from azure.ai.ml import MLClient
from azure.identity import AzureCliCredential

credential = AzureCliCredential()
credential.get_token("https://management.azure.com/.default")

ml_client = MLClient(
    credential,
    subscription_id="ed2cab61-14cc-4fb3-ac23-d72609214cfd",
    resource_group_name="AMLDataCache",
    workspace_name="datacachetest",
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### 3. Create a compute

Azure Machine Learning needs a compute resource to run a job. This resource can be single or multi-node machines with Linux or Windows OS. In the following example script, we provision a `Standard_ND40rs_v2` SKU which is infiniband enabled to provide higher node communication bandwidth and low latency with mellanox drivers to create an Azure Machine Learning compute. You can get the list and more detail [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-hpc#rdma-capable-instances)

In [45]:
from azure.ai.ml.entities import AmlCompute

experiment_name = "T5-Summarization-news-summary"

# If you already have a gpu cluster, mention it here. Else will create a new one
compute_cluster = "v100"
try:
    compute = ml_client.compute.get(compute_cluster)
    print("successfully fetched compute:", compute.name)
except Exception as ex:
    print("failed to fetch compute:", compute_cluster)
    print("creating new Standard_ND40rs_v2 compute")
    compute = AmlCompute(
        name=compute_cluster,
        size="Standard_ND40rs_v2", # Info on Standard_ND40rs_v2 SKU: https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series
        min_instances=1,
        max_instances=2,  # For multi node training set this to an integer value more than 1
    )
    ml_client.compute.begin_create_or_update(compute).wait()
    print("successfully created compute:", compute.name)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
successfully fetched compute: v100


#### 4. Create a job environment using Azure Container for Pytorch

We will be creating a custom environment using existing ACPT curated environment consisting of state of art technologies like Deepspeed, OnnxRuntime. You can get more detail from [Custom Environment](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-azure-container-for-pytorch-environment?view=azureml-api-2)


view the [Environments in Azure Machine Learning studio](https://ml.azure.com/environments)

In [46]:
from azure.ai.ml.entities import Environment, BuildContext
import datetime

# Define an environment name
env_name = "env-" + datetime.datetime.now().strftime("%m%d%H%M%f")

env_docker_context = Environment(
    build=BuildContext(path="src/Environment/context"),
    name=env_name,
    description="Environment created from a Docker context.",
)
ml_client.environments.create_or_update(env_docker_context)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
..

Environment({'is_anonymous': False, 'auto_increment_version': False, 'name': 'env-07102237994041', 'description': 'Environment created from a Docker context.', 'tags': {}, 'properties': {}, 'id': '/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourceGroups/AMLDataCache/providers/Microsoft.MachineLearningServices/workspaces/datacachetest/environments/env-07102237994041/versions/1', 'Resource__source_path': None, 'base_path': '/bert_ort/prathikrao/onnxruntime-training-examples/T5', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fd175a6f8b0>, 'serialize': <msrest.serialization.Serializer object at 0x7fd466b7ea90>, 'version': '1', 'latest_version': None, 'conda_file': None, 'image': None, 'build': <azure.ai.ml.entities._assets.environment.BuildContext object at 0x7fd175a6f310>, 'inference_config': None, 'os_type': 'Linux', 'arm_type': 'environment_version', 'conda_file_path': None, 'path': None, 'datastore': None, 'upload_hash': None, 'translated_cond

#### 5. Pick the dataset for fine-tuning the model

The [CNN DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. It is larger than 1GB when uncompressed. 

We want this sample to run quickly, so a copy of the fraction of dataset is used for fine tuning job.This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. 
* Visualize some data rows. 

In [47]:
import pandas as pd
pd.set_option(
    "display.max_colwidth", 1000
)

train_df = pd.read_json("./src/Finetune/cnn_daily.jsonl", lines=True)
train_df.head()

Unnamed: 0,article,highlights,id
0,"We are well versed in the dangers of using sunbeds, with skin cancer preying on the minds of all who expose their skin to the UV rays. But experts have warned the disease is not the only health danger associated with the pursuit of an artificial year-round tan. Those compelled by a desire for bronzed skin are also at risk of catching a sexually transmitted infection - herpes. Dermatologist Dawn Marie Davies, from the Mayo Clinic in Minnesota, warns sunbed users could be putting themselves at risk of contracting the sexually transmitted infection, herpes . Genital herpes is highly contagious and spreads from one person to another via skin-to-skin contact. It is commonly passed on through sex and oral sex. Once a person is infected with the virus it can reactivate every so often to cause a new episode of painful genital herpes. There is no complete cure for the infection - rather, a sufferer will be given treatment each time they suffer a flare up. Dermatologist Dawn Marie Davies, an...","Skin cancer is not the only health danger lurking for sunbed users .\nDermatologist warns the herpes virus can thrive in the warm enviroment .\nUltraviolent light can kill bacteria, but level in tanning booths is not enough .\nHerpes is highly contagious and incurable and spreads via skin contact .",39553f1871b87d88d22bd3a1dd938b0142ced6ca
1,"From Victoria Beckham to the Duchess of Cambridge, our favourite celebrities always dress to impress. But if you thought they rolled out of bed and picked up the first thing in their wardrobe, you're mistaken. The A-listers have tried-and-tested techniques when it comes to getting dressed and they all follow simple styling hacks to flatter their figure. FEMAIL has pulled together the simple but effective celebrity-favoured styling tips you can employ to make the most of your figure. Scroll down for video . The A-listers have tried-and-tested techniques when it comes to getting dressed and they all follow simple styling hacks to flatter their figure . COMBINE NAVY AND BLACK . The fash-pack have long championed head-to-toe black, but they're shaking it up for spring by injecting another colour into their wardrobes. Navy rocked the runways, from Calvin Klein to The Row, and the likes of Victoria Beckham and Rita Ora can't get enough of it. As our favourite stars prove, combining trust...",Victoria Beckham has updated her black wardrobe with a hint of navy .\nThe Duchess of Cambridge's favourite nude shoes make legs appear leaner .\nSticking to one neutral colour palette like Kim Kardashian is flattering .,738faf68a71f062e01376b06f4e1f12507e9cdbf
2,"An American company claims to have invented a 'teabag' that can turn an ordinary pint of lager into a craft beer the same way hot water is turned into brew . The Hop Theory 'beer-bag' contains a blend of hops, fruit peels and natural spices, and promises to turn light beer into craft after just two minutes of steeping. However, despite nearly reaching its crowd-funding target, the project has been criticised by professional breweries as being misleading about what constitutes as 'craft beer'. Scroll down for video . Transformation: The Hop Theory sachets promises to turn lager into craft beer in just two minutes . Hop Theory, based in Maryland, is nearing its $25,000 target on crowd-funding site Kickstarter to produce its first blend, claiming it will turn lager into 'craft beer' in just two to four minutes. Bobby Gattuso, who founded the company as a biology student in 2013, hopes to revolutionise beer drinking. 'Craft beer excels in taste but it's expensive. Light beer is cost ...","Maryland start-up promises to turn lager into craft beer with 'tea bag'\nInfusion sachet contains a blend of hops, fruit peels and natural spices .\nAfter two minutes in a pint, the Hop Theory bag has created 'craft beer'\nCriticised by breweries for being misleading about what craft beer is .",03b0e69b52d27eb51f4aa2af9769913ad8352002
3,"A cafe in the Philippines is serving up artistic cups of coffee for customers who enjoy their beverages tailor made. The owner of the Bunny Baker Cafe in Manila etches customised caricatures into coffee froth at no extra cost to his clientele, even detailing local favourite, boxer Manny Pacquiao. Graphic artist Zach Yonzon runs the cafe with his wife and uses steamed milk and froth as the canvas upon which he creates his masterpieces which can leave happy memories for tourists. Zach Yonzon uses steamed milk and froth as the canvas upon which he creates his masterpieces . And the tools of his unique trade include a spoon and a barbecue skewer, which he dips in dark chocolate. The service started out as a simple novelty when the owner began etching rabbits, which coincide with the cafe's theme, into cups of coffee. But the idea quickly expanded when customers started asking for more intricate and complicated designs. This means tourists to the area can leave with a special memento fr...",Zach Yonzon runs the Bunny Baker Cafe with his wife in Manila .\nArtist uses a spoon and a barbecue skewer dipped in chocolate .\nCreates incredibly detailed portraits at the request of customers .\nService started out as novelty when owner began etching rabbits .\nArtistic barista hopes to one day be able to create 3D caricatures .,dc69e09212bb9ac1965ba53f7437c7a5f9193eb4
4,"A graphic on NBC's Today show on Wednesday misidentified Saturday Night Live creator Lorne Michaels as 'Lauren'. The flub by a graphics person, made on the East Coast feed of the morning show, was corrected for broadcasts in other time zones and online, the network said. Today had interviewed 70-year-old Michaels for a story Matt Lauer did on a New York gathering for people listed by TIME magazine as the 100 most influential in the world. Michaels is a legend at NBC for SNL - making the mistake by his own network even more embarrassing. Scroll down for video . Naming blunder: A graphic on NBC's Today show on Wednesday misidentified Saturday Night Live creator Lorne Michaels as 'Lauren' Michaels, a native of Toronto, made the cut for the second time since 2008. The thrice married father-of-three made up part of the Titans category on the list compiled by TIME, which included rapper Kanye West, Apple CEO Tim Cook and reality TV star Kim Kardashian. 'Hard to think of anyone more skil...","A graphic on NBC's Today show on Wednesday misidentified Saturday Night Live creator Lorne Michaels as 'Lauren'\nThe flub by a graphics person, made on the East Coast feed of the morning show, was corrected for broadcasts in other time zones and online .\nToday interviewed 70-year-old Michaels for a story on New York gathering for influential people listed by TIME magazine .",451829b5e4171ac3b54d2c4457b09855d429c33a


#### 6. Finetune the T5 small model for Summarization task

Leveraging Deepspeed and Onnxruntime accelarators for improving the efficiency for memory and compute and in turn reduce the training cost. 

The table below details some of the parameters passed to the training job.

| Parameters/accelarators | Description |
| ----------------- | --- |
| model_name | The name of the model getting finetuned. Here we specify T5-small. |
| ort | [Onnx runtime](https://github.com/microsoft/onnxruntime) accelarates 2x speed up in training time for SOTA models and optimizes memory to fit larger model such as GPT3 on 16GB GPU which would otherwise run out of mmemory. |
| deepspeed | [Deepspeed](https://github.com/microsoft/deepspeed) enables running billions of parameter models distributed across GPUs and provide different stages for memory and compute efficiency. |
| number of epochs | 1 |
| max train samples | 10 |
| Nebula | checkpointing |


In [48]:
from azure.ai.ml import command

env_name = "MSBuildLab110_env@latest" # FOR DEMO

job = command(
    code="src/Finetune/",
    command="python train_summarization_optimum.py \
        --deepspeed ds_config.json \
        --model_name_or_path t5-small \
        --dataset_name cnn_dailymail \
        --max_train_samples=10 \
        --max_eval_samples=10 \
        --dataset_config '3.0.0' \
        --do_train \
        --num_train_epochs=1 \
        --per_device_train_batch_size=16 \
        --per_device_eval_batch_size=16  \
        --output_dir outputs \
        --overwrite_output_dir \
        --fp16 \
        --optim adamw_ort_fused",
    compute=compute_cluster,
    environment=env_name,
    instance_count=1,  
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 8,
    },
)
job = ml_client.jobs.create_or_update(job)
job.studio_url

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

'https://ml.azure.com/runs/witty_screw_hmmsvtnqhz?wsid=/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourcegroups/AMLDataCache/workspaces/datacachetest&tid=72f988bf-86f1-41af-91ab-2d7cd011db47'

#### Results show **~300%** improvement of Fine-tune job with 100 epoch and CNN_Daily dataset with ORT, Deepspeed and Nebula checkpointing

| Accelerator | Train Runtime | Train Samples Per Sec | Train Steps Per Sec | Train Loss | FLOPS          | AML Link                                                   |
| ----------- | ------------- | --------------------- | ------------------- | ---------- | -------------- | ---------------------------------------------------------- |
| -           | 1d 9h 6m 22s  | 241.782               | 0.945               | 1.704454   | 7.7719 x 10^18 | [100epoch_and_no_accelarator][100epoch_and_no_accelarator] |
| ORT+DS      | 10h 53 m 36s  | 744.384               | 2.909               | 1.850196   | 7.8759 x 10^18 | [100epoch_with_DS_ORT_Nebula][100epoch_with_DS_ORT_Nebula] |

[100epoch_and_no_accelarator]: https://ml.azure.com/experiments/id/236409a2-f1d9-41da-9b7e-19ec7bfd23a6/runs/tough_cord_3wz4fymfjf?wsid=/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourcegroups/AMLDataCache/workspaces/datacachetest&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
[100epoch_with_DS_ORT_Nebula]: https://ml.azure.com/experiments/id/236409a2-f1d9-41da-9b7e-19ec7bfd23a6/runs/tough_fennel_7n42y51slk?wsid=/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourcegroups/AMLDataCache/workspaces/datacachetest&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

#### 7. Operationalizing the model

##### 7.1 Register Onnx model
**NOTE: STEP 6 (T5 FINETUNE) MUST COMPLETE BEFORE RUNNING THIS CELL**

In [49]:
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import Model
import time

timestamp = str(int(time.time()))
model_name = "T5Model"
job.name = "loyal_nose_p0r4vzq8w8" # FOR DEMO

#Onnx model registration
modelpath = "azureml://jobs/{jobname}/outputs/artifacts/outputs/onnx".format(jobname = job.name)
cloud_model = Model(
    path=modelpath,
    name=model_name+"_onnx",
    type=AssetTypes.CUSTOM_MODEL,
    description="Model created from cloud path.",
    version=timestamp,
)
ml_client.models.create_or_update(cloud_model)

.

Model({'job_name': 'loyal_nose_p0r4vzq8w8', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'T5Model_onnx', 'description': 'Model created from cloud path.', 'tags': {}, 'properties': {}, 'id': '/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourceGroups/AMLDataCache/providers/Microsoft.MachineLearningServices/workspaces/datacachetest/models/T5Model_onnx/versions/1689028640', 'Resource__source_path': None, 'base_path': '/bert_ort/prathikrao/onnxruntime-training-examples/T5', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fd1c0eb8d60>, 'serialize': <msrest.serialization.Serializer object at 0x7fd175a7a070>, 'version': '1689028640', 'latest_version': None, 'path': 'azureml://subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourceGroups/AMLDataCache/workspaces/datacachetest/datastores/workspaceartifactstore/paths/ExperimentRun/dcid.loyal_nose_p0r4vzq8w8/outputs/onnx', 'datastore': None, 'utc_time_created': None, 'flavors': None, 'a

##### 7.2 Create online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [50]:
from azure.ai.ml.entities import ManagedOnlineEndpoint
import datetime

# Define an endpoint name
endpoint_name = "endpt-" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name = endpoint_name, 
    description="this is a endpoint for T5 summarization model",
    auth_mode="key"
)

ml_client.online_endpoints.begin_create_or_update(endpoint).wait()
ml_client.begin_create_or_update(endpoint).result()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
.........................

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://endpt-07102237737235.eastus.inference.ml.azure.com/score', 'openapi_uri': 'https://endpt-07102237737235.eastus.inference.ml.azure.com/swagger.json', 'name': 'endpt-07102237737235', 'description': 'this is a endpoint for T5 summarization model', 'tags': {}, 'properties': {'azureml.onlineendpointid': '/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourcegroups/amldatacache/providers/microsoft.machinelearningservices/workspaces/datacachetest/onlineendpoints/endpt-07102237737235', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/providers/Microsoft.MachineLearningServices/locations/eastus/mfeOperationsStatus/oe:3f9aec1f-8bce-45dd-9358-bb5750b6f300:d304a2c4-1355-4c0a-8f13-d7329ff5edf4?api-version=2022-02-01-preview'}, 'id': '/subscriptions/ed2cab61-14cc-4fb3-ac23-d72609214cfd/resourceGroups/AMLDataCache/provide

##### 7.3 Deploy scoring file to the endpoint

In [51]:
from azure.ai.ml.entities import (
    CodeConfiguration,
    ManagedOnlineDeployment
)

model = ml_client.models.get(name=model_name+"_onnx", version = timestamp)

blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=model,
    environment="MSBuildLab110_env@latest",
    code_configuration=CodeConfiguration(
        code=".", scoring_script="src/Operationalize/score_onnx.py"
    ),
    instance_type="Standard_F8s_v2",
    instance_count=1,
)

ml_client.online_deployments.begin_create_or_update(blue_deployment)

.

Check: endpoint endpt-07102237737235 exists


.

Your file exceeds 100 MB. If you experience low upload speeds or latency, we recommend using the AzCopy tool for this file transfer. See https://docs.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
Uploading T5 (976.29 MBs):   0%|          | 1819240/976292071 [00:01<09:30, 1709020.01it/s]

.

Uploading T5 (976.29 MBs): 100%|██████████| 976292071/976292071 [00:05<00:00, 164452459.11it/s]




.

data_collector is not a known attribute of class <class 'azure.ai.ml._restclient.v2022_02_01_preview.models._models_py3.ManagedOnlineDeployment'> and will be ignored


.

<azure.core.polling._poller.LROPoller at 0x7fd46457b550>

##### 7.4: Invoke the endpoint to score data by using your model
**NOTE: STEP 7.3 (ENDPOINT DEPLOYMENT) MUST COMPLETE BEFORE RUNNING THIS CELL**

Test the blue deployment with some sample data


In [52]:
endpoint_name = "MSBuildLab110-endpoint" # FOR DEMO

ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name="blue",
    request_file="src/Operationalize/payload.json",
)

..

'{"summary": "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It\'s the most aggressive action on tackling the climate crisis in American history. It\'ll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share.", "inference time (seconds)": 0.3802335262298584}'

##### 7.5: Delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [53]:
ml_client.online_endpoints.begin_delete(name=endpoint_name)

...............

#### Aside: Scoring files for ONNX Runtime Inference vs. Hugging Face Inference

![Image](assets/T5_beamsearch.png)

In [None]:
import json 
import numpy as np
import onnxruntime
onnxruntime.set_default_logger_severity(3)
import os
import time
from transformers import AutoTokenizer

# Documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-online-endpoints
# Troubleshooting: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints
  
# The init() method is called once, when the web service starts up.
def init():  
    global SESS
    global TOKENIZER
    # The AZUREML_MODEL_DIR environment variable indicates  
    # a directory containing the model file you registered.  
    # model_filename = os.path.join(os.environ['AZUREML_MODEL_DIR'], "onnx/outputs_beam_search.onnx")  

    model_filename = "src/Model/onnx/outputs_beam_search.onnx" 
    SESS = onnxruntime.InferenceSession(model_filename, providers=["CPUExecutionProvider"])

    TOKENIZER = AutoTokenizer.from_pretrained("t5-small")
  
# The run() method is called each time a request is made to the scoring API.  
def run(data):
    json_data = json.loads(data)
    input_data = json_data["inputs"]["article"]
    
    input_ids = TOKENIZER(str(input_data), return_tensors="pt").input_ids

    ort_inputs = {
        "input_ids": np.array(input_ids, dtype=np.int32),
        "max_length": np.array([512], dtype=np.int32),
        "min_length": np.array([0], dtype=np.int32),
        "num_beams": np.array([1], dtype=np.int32),
        "num_return_sequences": np.array([1], dtype=np.int32),
        "length_penalty": np.array([1.0], dtype=np.float32),
        "repetition_penalty": np.array([1.0], dtype=np.float32)
    }
    
    out = SESS.run(None, ort_inputs)[0][0] # 0th batch, 0th sample

    summary = TOKENIZER.decode(out[0], skip_special_tokens=True)

    # You can return any JSON-serializable object.
    return {"summary": summary}

def test():
    # NOTE: You need to comment out model_filename = os.path.join(...) in init() for local testing
    init()
    payload = {
        "inputs": {
            "article": ["summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."],
            "params": {
                "max_new_tokens": 512
            }
        }
    }
    payload = str.encode(json.dumps(payload))
    res = run(payload)
    print(res)

    # timed run
    start = time.time()
    for i in range(10):
        _ = run(payload)
    diff = time.time() - start
    print(f"time {diff/10} sec")

test()


.{'summary': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share."}
time 0.41439123153686525 sec


In [None]:
import numpy as np
import os
from transformers import pipeline
import json 
import time
import joblib
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForSeq2SeqLM
import torch

# Documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-online-endpoints
# Troubleshooting: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints
  
# The init() method is called once, when the web service starts up.
def init():  
    global MODEL
    global TOKENIZER
    # The AZUREML_MODEL_DIR environment variable indicates  
    # a directory containing the model file you registered.  
    # model_path = os.path.join(os.environ['AZUREML_MODEL_DIR'])
    # model_file = os.path.join(os.environ['AZUREML_MODEL_DIR'], "pytorch_model.bin")

    model_path = "src/Model"
    model_file = "src/Model/pytorch_model.bin"
    TOKENIZER = AutoTokenizer.from_pretrained(model_path)
    config = AutoConfig.from_pretrained(model_path)
    MODEL = AutoModelForSeq2SeqLM.from_pretrained(model_file, config=config) 
    
  
# The run() method is called each time a request is made to the scoring API.  
def run(data):
    json_data = json.loads(data)
    input_data = json_data["inputs"]["article"]
    inputs = TOKENIZER(str(input_data), return_tensors="pt").input_ids

    out = MODEL.generate(inputs, max_new_tokens=512, do_sample=False)

    summary = TOKENIZER.decode(out[0], skip_special_tokens=True)
      
    # You can return any JSON-serializable object.  
    return {"summary": summary}

    
def test():
    # NOTE: You need to comment out model_file/path = os.path.join(...) in init() for local testing
    init()
    payload = {
        "inputs": {
            "article": ["summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."],
            "params": {
                "max_new_tokens": 512
            }
        }
    }
    payload = str.encode(json.dumps(payload))
    res = run(payload)
    print(res)
    
    # timed run
    start = time.time()
    for i in range(10):
        _ = run(payload)
    diff = time.time() - start
    print(f"time {diff/10} sec")

test()

.{'summary': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share."}
...time 1.7115404605865479 sec


...