In [9]:
try:
    # prefer parent path so notebook can be run from this folder
    %pip install -r ../../../requirements.txt
except Exception:
    # fallback to local path
    %pip install -r requirements.txt

print("Installation command executed. Restart kernel if necessary.")

Note: you may need to restart the kernel to use updated packages.
Installation command executed. Restart kernel if necessary.


# Deploy a TensorFlow model served with TF Serving using a custom container in an online endpoint
Learn how to deploy a custom container as an online endpoint in Azure Machine Learning.

Custom container deployments can use web servers other than the default Python Flask server used by Azure Machine Learning. Users of these deployments can still take advantage of Azure Machine Learning's built-in monitoring, scaling, alerting, and authentication.

## Prerequisites

* To use Azure Machine Learning, you must have an Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).

* Install and configure the [Python SDK v2](sdk/setup.sh).

* You must have an Azure resource group, and you (or the service principal you use) must have Contributor access to it.

* You must have an Azure Machine Learning workspace. 

* To deploy locally, you must install [Docker Engine](https://docs.docker.com/engine/install/) on your local computer. We highly recommend this option, so it's easier to debug issues.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [10]:
# import required libraries
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)
from azure.identity import DefaultAzureCredential

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../jobs/configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [11]:
# Try to load workspace details from a local .env file (at notebooks/.env)
import os
from pathlib import Path

# Relative path from this notebook to the notebooks/.env file
env_path = Path("../../../.env")

if env_path.exists():
    with env_path.open() as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("#"):
                continue
            if "=" in line:
                k, v = line.split("=", 1)
                os.environ[k.strip()] = v.strip()

    # Read expected variables
    subscription_id = os.environ.get("SUBSCRIPTION_ID", "")
    resource_group = os.environ.get("RESOURCE_GROUP", "")
    workspace = os.environ.get("WORKSPACE_NAME", "")
    appinsights_conn = os.environ.get("APPINSIGHTS_CONNECTION_STRING", "")
else:
    print(f".env file not found at {env_path}. Please set subscription_id, resource_group, workspace variables manually.")

print("Loaded workspace configuration:")
print("  SUBSCRIPTION_ID=", "")
print("  RESOURCE_GROUP=", "")
print("  WORKSPACE_NAME=", "")

Loaded workspace configuration:
  SUBSCRIPTION_ID= 
  RESOURCE_GROUP= 
  WORKSPACE_NAME= 


In [12]:
# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


# 2. Download a TensorFlow model

BASE_PATH=endpoints/online/custom-container

AML_MODEL_NAME=tfserving-mounted

MODEL_NAME=half_plus_two

MODEL_BASE_PATH=/var/azureml-app/azureml-models/$AML_MODEL_NAME/1

Download and unzip a model that divides an input by two and adds 2 to the result

`wget https://aka.ms/half_plus_two-model -O $BASE_PATH/half_plus_two.tar.gz`

`tar -xvf $BASE_PATH/half_plus_two.tar.gz -C $BASE_PATH`

In in this sample, we have already downloaded the model.

In [14]:
# Prepare TensorFlow Serving directory structure and register the model
from pathlib import Path
import shutil
from azure.ai.ml.entities import Model

original_model_dir = Path("half_plus_two")
serving_root = Path("tfservingcustom_serving")
version_subdir = serving_root / "half_plus_two" / "1"

if serving_root.exists():
    shutil.rmtree(serving_root)
version_subdir.mkdir(parents=True, exist_ok=True)

for item in original_model_dir.iterdir():
    destination = version_subdir / item.name
    if item.is_dir():
        shutil.copytree(item, destination)
    else:
        shutil.copy2(item, destination)

registered_model = ml_client.models.create_or_update(
    Model(
        name="tfservingcustom",
        path=str(serving_root),
        type="custom_model",
        description="TensorFlow SavedModel packaged for TF Serving",
        tags={"framework": "tensorflow", "format": "saved_model"},
    )
)

print(f"Registered model: {registered_model.name} (version {registered_model.version})")

Registered model: tfservingcustom (version 1)


# 3. Test locally
## 3.1 Use docker to run your image locally for testing
Use docker to run your image locally for testing

`docker run --rm -d -v $PWD/$BASE_PATH:$MODEL_BASE_PATH -p 8501:8501 \
 -e MODEL_BASE_PATH=$MODEL_BASE_PATH -e MODEL_NAME=$MODEL_NAME \
 --name="tfserving-test" docker.io/tensorflow/serving:latest
sleep 10`

## 3.2 Check that you can send liveness and scoring requests to the image
First, check that the container is "alive," meaning that the process inside the container is still running. You should get a 200 (OK) response.

`curl -v http://localhost:8501/v1/models/$MODEL_NAME`

## 3.3 Check that you can get predictions about unlabeled data
`curl --header "Content-Type: application/json" \
  --request POST \
  --data @$BASE_PATH/sample_request.json \
  http://localhost:8501/v1/models/$MODEL_NAME:predict`

## 3.4 Stop the image
Now that you've tested locally, stop the image

`docker stop tfserving-test`

# 4. Deploy your online endpoint to Azure
Next, deploy your online endpoint to Azure.

## 4.1 Configure online endpoint
`endpoint_name`: The name of the endpoint. It must be unique in the Azure region. Naming rules are defined under [managed online endpoint limits](https://docs.microsoft.com/azure/machine-learning/how-to-manage-quotas#azure-machine-learning-managed-online-endpoints-preview).

`auth_mode` : Use `key` for key-based authentication. Use `aml_token` for Azure Machine Learning token-based authentication. A `key` does not expire, but `aml_token` does expire. 

Optionally, you can add description, tags to your endpoint.

In [15]:
# Creating a unique endpoint name with current datetime to avoid conflicts
import datetime

online_endpoint_name = "endpoint-gpu" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is a sample online endpoint",
    auth_mode="key",
    tags={"foo": "bar"},
)

## 4.2 Create the endpoint
Using the `MLClient` created earlier, we will now create the Endpoint in the workspace. This command will start the endpoint creation and return a confirmation response while the endpoint creation continues.

In [16]:
ml_client.begin_create_or_update(endpoint).result()

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://endpoint-gpu10090133451811.canadacentral.inference.ml.azure.com/score', 'openapi_uri': 'https://endpoint-gpu10090133451811.canadacentral.inference.ml.azure.com/swagger.json', 'name': 'endpoint-gpu10090133451811', 'description': 'this is a sample online endpoint', 'tags': {'foo': 'bar'}, 'properties': {'createdBy': 'System Administrator', 'createdAt': '2025-10-09T01:33:47.114993+0000', 'lastModifiedAt': '2025-10-09T01:33:47.114993+0000', 'azureml.onlineendpointid': '/subscriptions/5784b6a5-de3f-4fa4-8b8f-e5bb70ff6b25/resourcegroups/rg-aml-ws-prod-cc-01/providers/microsoft.machinelearningservices/workspaces/mlwprodcc01/onlineendpoints/endpoint-gpu10090133451811', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/5784b6a5-de3f-4fa4-8b8f-e5bb70ff6b25/providers/Microsoft.MachineLearningServices/locations/canadacentral/mfeOperationsStatus/oeidp:fb55c029-0b

## 4.3 Configure online deployment
A deployment is a set of resources required for hosting the model that does the actual inferencing. We will create a deployment for our endpoint using the `ManagedOnlineDeployment` class.

### Key aspects of deployment 
- `name` - Name of the deployment.
- `endpoint_name` - Name of the endpoint to create the deployment under.
- `model` - The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification.
- `environment` - The environment to use for the deployment. This value can be either a reference to an existing versioned environment in the workspace or an inline environment specification.
- `code_configuration` - the configuration for the source code and scoring script
    - `path`- Path to the source code directory for scoring the model
    - `scoring_script` - Relative path to the scoring file in the source code directory
- `instance_type` - The VM size to use for the deployment. For the list of supported sizes, see [Managed online endpoints SKU list](https://docs.microsoft.com/en-us/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list).
- `instance_count` - The number of instances to use for the deployment

In [None]:
# create a GPU deployment backed by the tfservingcustom model
try:
    model = ml_client.models.get(name="tfservingcustom", version=str(registered_model.version))
    model_version = registered_model.version
except NameError:
    latest_model = list(ml_client.models.list(name="tfservingcustom"))
    if not latest_model:
        raise RuntimeError("No registered model named tfservingcustom found. Run step 2 first.")
    latest_model = sorted(latest_model, key=lambda m: int(m.version), reverse=True)[0]
    model = latest_model
    model_version = latest_model.version

model_mount_base = f"/var/azureml-app/azureml-models/{model.name}/{model_version}"

env = Environment(
    image="docker.io/tensorflow/serving:latest",
    inference_config={
        "liveness_route": {"port": 8501, "path": "/v1/models/half_plus_two"},
        "readiness_route": {"port": 8501, "path": "/v1/models/half_plus_two"},
        "scoring_route": {"port": 8501, "path": "/v1/models/half_plus_two:predict"},
    },
)

environment_variables = {
    "MODEL_BASE_PATH": model_mount_base,
    "MODEL_NAME": "half_plus_two",
}

print("Using model:", model.name, "version", model_version)
print("MODEL_BASE_PATH:", environment_variables["MODEL_BASE_PATH"])

blue_deployment = ManagedOnlineDeployment(
    name="orange",
    endpoint_name=online_endpoint_name,
    model=model,
    environment=env,
    environment_variables=environment_variables,
    instance_type="Standard_F2s_v2",
    instance_count=1,
)

In [None]:
# Inspect packaged model structure before deployment
from pathlib import Path

print("Packaged model directory structure:")
for path in sorted(Path("tfservingcustom_serving").rglob("*")):
    rel = path.relative_to("tfservingcustom_serving")
    if path.is_dir():
        print(f"  [DIR] {rel}")
    else:
        print(f"  [FILE] {rel}")

### Readiness route vs. liveness route
An HTTP server defines paths for both liveness and readiness. A liveness route is used to check whether the server is running. A readiness route is used to check whether the server is ready to do work. In machine learning inference, a server could respond 200 OK to a liveness request before loading a model. The server could respond 200 OK to a readiness request only after the model has been loaded into memory.

Review the [Kubernetes documentation](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) for more information about liveness and readiness probes.

Notice that this deployment uses the same path for both liveness and readiness, since TF Serving only defines a liveness route.

## 4.4 Create the deployment
Using the `MLClient` created earlier, we will now create the deployment in the workspace. This command will start the deployment creation and return a confirmation response while the deployment creation continues.

In [None]:
ml_client.begin_create_or_update(blue_deployment).result()

Check: endpoint endpoint-gpu10090056098905 exists


...............................................................................................................................................................................

HttpResponseError: (ResourceNotReady) User container has crashed or terminated: Liveness probe failed: Get . Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-resourcenotready
Code: ResourceNotReady
Message: User container has crashed or terminated: Liveness probe failed: Get . Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-resourcenotready

In [None]:
# blue deployment takes 100 traffic
endpoint.traffic = {"blue": 100}
ml_client.begin_create_or_update(endpoint).result()

#endpoint.traffic = {"blue": 0, "green": 0, "orange": 100}
#ml_client.begin_create_or_update(endpoint).result()

# 5. Test the endpoint with sample data
Using the `MLClient` created earlier, we will get a handle to the endpoint. The endpoint can be invoked using the `invoke` command with the following parameters:
- `endpoint_name` - Name of the endpoint
- `request_file` - File with request data
- `deployment_name` - Name of the specific deployment to test in an endpoint

We will send a sample request using a [json](./model-1/sample-request.json) file. 

In [None]:
# test the blue deployment with some sample data
ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="orange",
    request_file="sample-request.json",
)

'{\n    "predictions": [2.5, 3.0, 4.5\n    ]\n}'

# 6. Managing endpoints and deployments

## 6.1 Get details of the endpoint

In [None]:
# Get the details for online endpoint
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

# existing traffic details
print(endpoint.traffic)

# Get the scoring URI
print(endpoint.scoring_uri)

## 6.2 Get the logs for the new deployment
Get the logs for the green deployment and verify as needed

In [None]:
ml_client.online_deployments.get_logs(
    name="blue", endpoint_name=online_endpoint_name, lines=50
)

# 7. Delete the endpoint

In [None]:
#ml_client.online_endpoints.begin_delete(name=online_endpoint_name)