Copyright (C) 2022 Intel Corporation
 
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
 
http://www.apache.org/licenses/LICENSE-2.0
 
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
and limitations under the License.
 

SPDX-License-Identifier: Apache-2.0

# General Description

Version: 1.0
Date: Sep 20, 2022

This notebook outlines the general usage of quantized (INT8) NLP Inference using Intel's CPU, PyTorch - with IPEX optimization, Intel Neural Compressor, and HuggingFace model on Azure Machine Learning platform. The trained BERT base model is further quantized by Intel Neural Compressor and converted into ONNX format.

Users may wish to base on parts of the code and customize those to suit their purposes.

# Prerequisite
Log in Azure - please go to the terminal/console and use the command below to login Azure. Follow the instructions shown in the terminal to perform interactive authentication.

Command:
'az login'

# Step 1: Create/Load the Azure Machine Learning workspace

In [None]:
from azureml.core import Workspace

try:
    ws = Workspace.from_config('./config.json')
    print('Loaded existing workspace configuration')
except:
    ws = Workspace.create(name='intel_azureml_ws',
            subscription_id='----USER AZURE SUBSCRIPTION ID----',  #Please fill in the azure-subscription-id 
            resource_group='intel_azureml_resource',    #
            create_resource_group=True,
            location='westus2'
            )
    ws.write_config(path="./", file_name="config.json")

# Step 2: Prepare the materials for quantizing trained HuggingFace model using Intel Neural Compressor
In order to quantize the trained HuggingFace model, users have to prepare 2 files:
1. INC config file
2. Trained HuggingFace PyTorch model

For the INC config file, we have prepared the ../src/inference_container/config/ptq.yaml file for users. It specified the operation (post quantization) to be performed by Intel Neural Compressor.

The trained model should be downloaded automatically under './fp32_model_output' from the previous steps. If that is not the case, users may need to go to the webpage of Azure Machine and download the trained HuggingFace PyTorch model. Go to the webpage of Azure Machine Learning. 'work_space_name' -> 'Jobs' -> 'the_jobs_id' -> 'Outputs + logs' -> 'outputs' -> 'trained_model'

The two directory and file will be uploaded through the following codes.

In [None]:
from azureml.core import Workspace, ScriptRunConfig, Environment, Experiment
from azureml.core.runconfig import MpiConfiguration, PyTorchConfiguration
from azureml.core import Workspace
from azureml.data.datapath import DataPath
from azureml.core import Dataset

datastore = ws.get_default_datastore()
Dataset.File.upload_directory(src_dir='../src/inference_container/config', 
                              target=DataPath(datastore, "/inc/ptq_config"),
                              overwrite=True
                             )
Dataset.File.upload_directory(src_dir='./fp32_model_output/outputs/trained_model', 
                              target=DataPath(datastore, "/trained_fp32_hf_model"),
                              overwrite=True
                             )

remote_inc_config =  Dataset.File.from_files(path=(datastore, '/inc/ptq_config'))
remote_fp32_model = Dataset.File.from_files(path=(datastore, '/trained_fp32_hf_model'))

# Step 2b: Remove the local /fp32_model_output folder
It is necessary to remove the ./fp32_model_output folder to avoiding triggering error (exceeds 300 MB for the experiment snapshots).

Details of the error can be referred as the following webpage:
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-save-write-experiment-files#limits

In [None]:
import shutil
shutil.rmtree('./fp32_model_output')

# Step 3: Start quantizting the model
Setup the cluster and environment for launch a quantization job. To change the quantization or use more Intel Neural Compressor features (e.g.: distillation, pruning etc.), users may wish to modify the inc_quantization.py and the related configuration file (i.e. ptq.yaml).

Note: For quantization, initiate one single node is sufficient for the quantization process.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

from azureml.core import Workspace, Environment
from azureml.core.environment import Environment
from azureml.core import Image

#initiate a node for quantization
cpu_cluster_name = "cpuCluster1xD64DSV4"
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D64DS_V4', max_nodes=1) #Ddsv4-series run on the 3rd Generation Intel® Xeon® Platinum 8370C (Ice Lake) or the Intel® Xeon® Platinum 8272CL (Cascade Lake).
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)

#initiate the environment for quantization
azure_ddp_ipex_hf_environment = Environment.get(ws, 'azure_ddp_ipex_hf_environment')
azure_ddp_ipex_hf_environment.python.user_managed_dependencies=True

#Setup the parameters for quantization
script_params = [
    '--inc_config_path',
    remote_inc_config.as_named_input('inc_config').as_mount(),
    '--inc_config_filename',
    'ptq.yaml',
    '--fp32_model_path',
    remote_fp32_model.as_named_input('fp32_hf_model_path').as_mount(),
    '--model_name',
    'bert-base-uncased'
]

run_config = ScriptRunConfig(
  source_directory= '../src/inference_container',
  script='inc_quantization.py',
  compute_target=cpu_cluster,
  environment=azure_ddp_ipex_hf_environment,
  arguments = script_params
)

# submit the run configuration to start the job
run = Experiment(ws, "INC_PTQ").submit(run_config)
run.wait_for_completion(show_output=True)
run.download_files(output_directory='quantized_model')

# Step 4: Deploy the model 
There are multiple steps:
1. Users can locate the quantized model downloaded in 'output' folder. If the directory does not exist, please use the webpage of the Azure Machine 
Learning Platoform to download the model to local directory - 'work_space_name' -> 'Jobs' -> 'the_jobs_id' -> 'Outputs + logs' -> 'outputs'.
2. Register the quantized model to the Azure ML platform.
3. Implement a score.py file to define the data preprocessing and post-processing at the end-point. It will also define the behavior of how the model infernece.

Inside score_hf.py, specify the number of physical cores for the environment variable GOMP_CPU_AFFINITY. The best configuration found for Standard_D16_v5 is currently set as default, but users may choose to explore different numbers of physical cores for different machines. For more information on number of physical cores for different machines, please visit:https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-general"

In [None]:
from azureml.core.model import Model
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core.webservice import AciWebservice
from azureml.core.compute import ComputeTarget, AmlCompute, AksCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.webservice import AksWebservice, Webservice


model_dir = 'quantized_model/outputs'
model = Model.register(workspace = ws,
                       model_path = model_dir,
                       model_name = "inc_ptq_bert_model_mrpc",
                       tags = {"Model": "inc_ptq_bert_model_mrpc"},
                       description = "Quantized HuggingFace Model",)

azure_ddp_ipex_hf_environment = Environment.get(ws, name='azure_ddp_ipex_hf_environment')
azure_ddp_ipex_hf_environment.python.user_managed_dependencies=True
azure_ddp_ipex_hf_environment.inferencing_stack_version = "latest"

inference_config = InferenceConfig(entry_script="../src/inference_container/score_hf.py", environment=azure_ddp_ipex_hf_environment)

#Create a AKS cluster
try:
    inference_node = AksCompute(workspace=ws, name="infericelake2")
    print('Found existing cluster, use it.')
except ComputeTargetException:
    prov_config = AksCompute.provisioning_configuration(vm_size = "Standard_D16_v5", agent_count=3, location="westus")
    inference_node = ComputeTarget.create(workspace = ws, name = 'infericelake2', provisioning_configuration=prov_config)
    inference_node.wait_for_completion(show_output=True)

deployment_config = AksWebservice.deploy_configuration(cpu_cores=4, memory_gb=16) # Specify the resources for this deployment

service_name = 'hf-aks-1'
print("Service", service_name)
service = Model.deploy(ws, service_name, [model], inference_config=inference_config, deployment_config=deployment_config, deployment_target=inference_node, overwrite=True)
service.wait_for_deployment(True)
print(service.state)

# Step 5: Start to call an inference
Users can call an inference to the endpoint using the operator - service

In [None]:
import json
#Input data - MRPC
sentence1 = "Shares of Genentech, a much larger company with several products on the market, rose more than 2 percent."
sentence2 = "Shares of Xoma fell 16 percent in early trade, while shares of Genentech, a much larger company with several products on the market, were up 2 percent."
input_data = json.dumps({'sentence1':sentence1, 'sentence2': sentence2})

try:
    aks_return = service.run(input_data)
    print(aks_return)
    result = aks_return['result']
    print('Classification result: ' + str(result))
except KeyError as e:
    print(str(e))

# Step 6: Clean up
Users may wish to delete the deployed endpoint

In [None]:
service.delete()