# Deploy a question-answering model to the Triton Inference Server on NVIDIA Tesla V100s in Azure Kubernetes Service

This notebook shows you how to deploy a Bi-Directional Attention Flow question-ansewring model to the high-performance [Triton Inference Server](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html) on Azure Kubernetes Service (AKS) graphical processing units (GPUs).

Please note that this Public Preview release is subject to the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

In [11]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')

## Preview steps

Necessary only while this feature is in preview, will be unnecessary in a future release of the Azure Machine Learning Python SDK.

In [12]:
from azureml.core.model import Model

Model.Framework.MULTI = "Multi"
Model._SUPPORTED_FRAMEWORKS_FOR_NO_CODE_DEPLOY.append(Model.Framework.MULTI)

## Download model

It's important that your model have this directory structure for Triton Inference Server to be able to load it. [Read more about the directory structure that Triton expects](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html).

In [3]:
import git
import sys
from pathlib import Path

# get the root of the repo
prefix = Path(git.Repo(".", search_parent_directories=True).working_tree_dir)

# Enables us to import helper functions as Python modules
path_to_insert = prefix.joinpath("code", "deployment", "triton").__str__()
if path_to_insert not in sys.path:
    sys.path.insert(1, path_to_insert)

from model_utils import download_triton_models

download_triton_models(prefix)

successfully downloaded model: densenet_onnx
successfully downloaded model: bidaf-9
successfully downloaded model: keiji_model


## Register model

In [4]:
from azureml.core.model import Model

model_path = prefix.joinpath("models", "triton")

model = Model.register(
    model_path=model_path,
    model_name="bidaf-9-example",
    tags={"area": "Natural language processing", "type": "Question-answering"},
    description="Question answering from ONNX model zoo",
    workspace=ws,
    model_framework=Model.Framework.MULTI
)

model

Registering model bidaf-9-example


Model(workspace=Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples'), name=bidaf-9-example, id=bidaf-9-example:33, version=33, tags={'area': 'Natural language processing', 'type': 'Question-answering'}, properties={})

## Deploy webservice

Deploy to a pre-created [AksCompute](https://docs.microsoft.com/python/api/azureml-core/azureml.core.compute.aks.akscompute?view=azure-ml-py#provisioning-configuration-agent-count-none--vm-size-none--ssl-cname-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--location-none--vnet-resourcegroup-name-none--vnet-name-none--subnet-name-none--service-cidr-none--dns-service-ip-none--docker-bridge-cidr-none--cluster-purpose-none--load-balancer-type-none-) named `aks-gpu-deploy`. For other options, see [our documentation](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-and-where?tabs=azcli).


In [13]:
%%time

from azureml.core.webservice import AksWebservice
from azureml.core.model import InferenceConfig
from random import randint

service_name = "triton-bidaf-9" + str(randint(10000, 99999))

config = AksWebservice.deploy_configuration(
    compute_target_name="aks-gpu-deploy",
    gpu_cores=1,
    cpu_cores=1,
    memory_gb=4,
    auth_enabled=False,
)

service = Model.deploy(
    workspace=ws,
    name=service_name,
    models=[model],
    deployment_config=config,
    overwrite=True,
)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running.
Failed


ERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: aa821d2d-2dbd-4d6c-938d-d5e1a42d73cb
Current sub-operation type not known, more logs unavailable.
Error:
{
  "code": "BadRequest",
  "statusCode": 400,
  "message": "The request is invalid",
  "details": [
    {
      "code": "DeploymentResourceInsufficient",
      "message": "Unschedulable: CPU/Memory resource insufficient, 1 replicas unschedulable. Please add more node(s) with SKU more than 1.1 CPU Cores and 5.2GB Memory and 1 GPU Cores to host your replicas OR reduce your replica count and service resource requirements."
    }
  ]
}

ERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: aa821d2d-2dbd-4d6c-938d-d5e1a42d73cb
Current sub-operation type not known, more logs unavailable.
Error:
{
  "code": "BadRequest",
  "statusCode": 400,
  "message": "The request is invalid",
  "details": [
    {
      "co

WebserviceException: WebserviceException:
	Message: Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: aa821d2d-2dbd-4d6c-938d-d5e1a42d73cb
Current sub-operation type not known, more logs unavailable.
Error:
{
  "code": "BadRequest",
  "statusCode": 400,
  "message": "The request is invalid",
  "details": [
    {
      "code": "DeploymentResourceInsufficient",
      "message": "Unschedulable: CPU/Memory resource insufficient, 1 replicas unschedulable. Please add more node(s) with SKU more than 1.1 CPU Cores and 5.2GB Memory and 1 GPU Cores to host your replicas OR reduce your replica count and service resource requirements."
    }
  ]
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service deployment polling reached non-successful terminal state, current service state: Failed\nOperation ID: aa821d2d-2dbd-4d6c-938d-d5e1a42d73cb\nCurrent sub-operation type not known, more logs unavailable.\nError:\n{\n  \"code\": \"BadRequest\",\n  \"statusCode\": 400,\n  \"message\": \"The request is invalid\",\n  \"details\": [\n    {\n      \"code\": \"DeploymentResourceInsufficient\",\n      \"message\": \"Unschedulable: CPU/Memory resource insufficient, 1 replicas unschedulable. Please add more node(s) with SKU more than 1.1 CPU Cores and 5.2GB Memory and 1 GPU Cores to host your replicas OR reduce your replica count and service resource requirements.\"\n    }\n  ]\n}"
    }
}

In [22]:
print(service.get_logs())

Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')

In [20]:
for service in ws.webservices:
    if service[:6]== 'triton':
        service = AksWebservice(ws, service)
        service.delete()

In [16]:
service = AksWebservice(ws, 'triton-bidaf-962220')
service.delete()

In [9]:
from utils import triton_init, get_model_info


triton_init(service.scoring_uri)

get_model_info()

Found model: bidaf-9, version: 1,               input meta: [{'name': 'query_char', 'datatype': 'BYTES', 'shape': [-1, 1, 1, 16]}, {'name': 'query_word', 'datatype': 'BYTES', 'shape': [-1, 1]}, {'name': 'context_word', 'datatype': 'BYTES', 'shape': [-1, 1]}, {'name': 'context_char', 'datatype': 'BYTES', 'shape': [-1, 1, 1, 16]}], input config: [{'name': 'query_char', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1', '1', '16']}, {'name': 'query_word', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1']}, {'name': 'context_word', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1']}, {'name': 'context_char', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1', '1', '16']}],               output_meta: [{'name': 'end_pos', 'datatype': 'INT32', 'shape': [1]}, {'name': 'start_pos', 'datatype': 'INT32', 'shape': [1]}], output config: [{'name': 'end_pos', 'data_type': 'TYPE_INT32', 'dims': ['1']}, {'name': 'start_pos', 'data_type': 'TYPE_INT32', 'dims': ['1']}]
Found model: densenet_onnx, version: 1,        

## Test the webservice

In [None]:
!pip install --upgrade nltk geventhttpclient python-rapidjson

In [10]:
import json

# Using a modified version of tritonhttpclient for Preview, PR is out for review
# https://github.com/triton-inference-server/server/pull/2047

from bidaf_utils import init, run

init(service.scoring_uri)

data = [
    "A quick brown fox jumped over the lazy dog.",
    "Which animal was lower?",
]
run(json.dumps(data))

[nltk_data] Downloading package punkt to /home/azureuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Found model: bidaf-9, version: 1,               input meta: [{'name': 'query_char', 'datatype': 'BYTES', 'shape': [-1, 1, 1, 16]}, {'name': 'query_word', 'datatype': 'BYTES', 'shape': [-1, 1]}, {'name': 'context_word', 'datatype': 'BYTES', 'shape': [-1, 1]}, {'name': 'context_char', 'datatype': 'BYTES', 'shape': [-1, 1, 1, 16]}], input config: [{'name': 'query_char', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1', '1', '16']}, {'name': 'query_word', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1']}, {'name': 'context_word', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1']}, {'name': 'context_char', 'data_type': 'TYPE_STRING', 'dims': ['-1', '1', '1', '16']}],               output_meta: [{'name': 'end_pos', 'datatype': 'INT32', 'shape': [1]}, {'name': 'start_pos', 'datatype': 'INT32', 'shape': [1]}], output config: [{'name': 'end_pos', 'data_type': 'TYPE_INT32', 'dims': ['1']}, {'name': 'start_pos', 'data_type': 'TYPE_INT32', 'dims': ['1']}]
Found model: densenet_onnx, version: 1,        

[b'lazy', b'dog']

## Delete the webservice and the downloaded model

In [None]:
from model_utils import delete_triton_models

service.delete()
delete_triton_models(prefix)

# Next steps

Try reading [our documentation](https://aka.ms/triton-aml-docs) to use Triton with your own models or check out the other notebooks in this folder for ways to do pre- and post-processing on the server. 