![image](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# Use watsonx, and Model Gateway to run as AI service with load balancing

#### Disclaimers

- Use only Projects and Spaces that are available in watsonx context.


## Notebook content

This notebook provides a detailed demonstration of the steps and code required to showcase support for watsonx.ai Model Gateway.

Some familiarity with Python is helpful. This notebook uses Python 3.11.


## Learning goal

The learning goal for your notebook is to leverage Model Gateway to create AI services using provided model from OpenAI compatible provider. You will also learn how to achieve model load balancing inside the AI service.

## Table of Contents

This notebook contains the following parts:

- [Setup](#setup)
- [Initialize and configure Model Gateway](#gateway-configuration)
- [Create model and deploy it as AI service](#create-model-ai-service)
- [Create models and deploy them as an AI service with load balancing](#create-models-ai-service-load-balancing)
- [Summary](#summary)

<a id="setup"></a>
## Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a <a href="https://cloud.ibm.com/catalog/services/watsonxai-runtime" target="_blank" rel="noopener no referrer">watsonx.ai Runtime Service</a> instance (a free plan is offered and information about how to create the instance can be found <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/wml-plans.html?context=wx&audience=wdp" target="_blank" rel="noopener no referrer">here</a>).

**Note:** The example of model load balancing presented in this sample notebook may raise `Status Code 429 (Too Many Requests)` errors when using the free plan, due to lower maximum number of requests allowed per second.

### Install dependencies
**Note:** `ibm-watsonx-ai` documentation can be found <a href="https://ibm.github.io/watsonx-ai-python-sdk/index.html" target="_blank" rel="noopener no referrer">here</a>.

In [1]:
%pip install -U "ibm_watsonx_ai>=1.3.40" | tail -n 1

[1A[2KSuccessfully installed anyio-4.10.0 cachetools-6.2.0 certifi-2025.8.3 charset_normalizer-3.4.3 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 ibm-cos-sdk-2.14.3 ibm-cos-sdk-core-2.14.3 ibm-cos-sdk-s3transfer-2.14.3 ibm_watsonx_ai-1.3.40 idna-3.10 jmespath-1.0.1 lomond-0.3.3 numpy-2.3.3 pandas-2.2.3 pytz-2025.2 requests-2.32.5 sniffio-1.3.1 tabulate-0.9.0 tzdata-2025.2 urllib3-2.5.0


### Define the watsonx.ai credentials
Use the code cell below to define the watsonx.ai credentials that are required to work with watsonx Foundation Model inferencing.

**Action:** Provide the IBM Cloud user API key. For details, see <a href="https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui" target="_blank" rel="noopener no referrer">Managing user API keys</a>.

In [2]:
import getpass
from ibm_watsonx_ai import Credentials

credentials = Credentials(
    url="https://ca-tor.ml.cloud.ibm.com",
    api_key=getpass.getpass("Enter your watsonx.ai api key and hit enter: "),
)

### Working with spaces

You need to create a space that will be used for your work. If you do not have a space, you can use [Deployment Spaces Dashboard](https://dataplatform.cloud.ibm.com/ml-runtime/spaces?context=wx) to create one.

- Click **New Deployment Space**
- Create an empty space
- Select Cloud Object Storage
- Select watsonx.ai Runtime instance and press **Create**
- Go to **Manage** tab
- Copy `Space GUID` and paste it below

**Tip**: You can also use SDK to prepare the space for your work. More information can be found [here](https://github.com/IBM/watson-machine-learning-samples/blob/master/cloud/notebooks/python_sdk/instance-management/Space%20management.ipynb).

**Action**: assign space ID below

In [3]:
import os

try:
    space_id = os.environ["SPACE_ID"]
except KeyError:
    space_id = input("Please enter your space_id (hit enter): ")

### Create `APIClient` instance

In [4]:
from ibm_watsonx_ai import APIClient

client = APIClient(credentials=credentials, space_id=space_id)

<a id="gateway-configuration"></a>
## Initialize and configure Model Gateway
In this section we will initialize the Model Gateway and configure its providers.

### Initialize the Model Gateway
Create `Gateway` instance

In [5]:
from ibm_watsonx_ai.gateway import Gateway

gateway = Gateway(api_client=client)

List available providers

In [6]:
gateway.providers.list()

Unnamed: 0,ID,NAME,TYPE


### Create secret instance in IBM Cloud Secrets Manager
When creating a model provider, you need to supply your credentials. This is achieved by creating a key-value secret in IBM Cloud Secrets Manager and providing its CRN in the provider creation request payload.

The exact specification of the secret content depends on the provider type. For more information, please see the [documentation](https://www.ibm.com/docs/en/watsonx/saas?topic=preview-setting-up-model-gateway). For watsonx.ai provider, the content should contain the following key-value pairs:

```json
{
    "apikey": "<YOUR_API_KEY>",
    "auth_url": "https://iam.cloud.ibm.com/identity/token",
    "base_url": "https://ca-tor.ml.cloud.ibm.com", // You can use a different location
    "space_id": "<YOUR_SPACE_ID>", // Required if `project_id` is not provided
    "project_id": "<YOUR_PROJECT_ID>", // Required if `space_id` is not provided
}
```

In [7]:
secret_crn_id = "PASTE_YOUR_SECRET_CRN_HERE"

### Work with watsonx.ai provider

Create provider

In [8]:
watsonx_ai_provider_details = gateway.providers.create(
    provider="watsonxai", name="watsonx-ai-provider", secret_crn_id=secret_crn_id
)

watsonx_ai_provider_id = gateway.providers.get_id(watsonx_ai_provider_details)
watsonx_ai_provider_id

'c18a1610-b2fa-47bb-a986-02bceff805d8'

Get provider details

In [9]:
gateway.providers.get_details(watsonx_ai_provider_id)

List available models for created provider

In [10]:
gateway.providers.list_available_models(watsonx_ai_provider_id)

Unnamed: 0,MODEL_ID,TYPE
0,ibm/granite-3-8b-instruct,IBM
1,ibm/granite-embedding-107m-multilingual,IBM
2,ibm/granite-embedding-278m-multilingual,IBM
3,ibm/slate-125m-english-rtrvr-v2,IBM
4,ibm/slate-30m-english-rtrvr-v2,IBM
5,intfloat/multilingual-e5-large,intfloat
6,meta-llama/llama-3-2-11b-vision-instruct,Meta:Hugging Face
7,meta-llama/llama-3-3-70b-instruct,Meta:Hugging Face
8,mistralai/mistral-large,Mistral AI:Mistral
9,mistralai/pixtral-12b,Mistral AI:Hugging Face


<a id="create-model-ai-service"></a>

## Create model and deploy it as AI service
In this section we will create a model using Model Gateway and deploy it as an AI service.

### Create model using Model Gateway
In this sample we will use the `ibm/granite-3-8b-instruct` model.

In [11]:
model = "ibm/granite-3-8b-instruct"

model_details = gateway.models.create(
    provider_id=watsonx_ai_provider_id,
    model=model,
)

model_id = gateway.models.get_id(model_details)

In [12]:
gateway.models.list()

Unnamed: 0,ID,MODEL,CREATED,TYPE
0,65ce9f6c-0aa6-4052-b84e-87b59a59586b,ibm/granite-3-8b-instruct,2025-09-22 12:46:13,watsonxai:watsonx-ai-provider


### Create custom software specification containing a custom version of `ibm-watsonx-ai` SDK

Define `requirements.txt` file for package extension

In [13]:
requirements_txt = "ibm_watsonx_ai>=1.3.40"

with open("requirements.txt", "w") as file:
    file.write(requirements_txt)

Get the ID of base software specification

In [14]:
base_software_specification_id = client.software_specifications.get_id_by_name(
    "runtime-24.1-py3.11"
)

Store the package extension

In [15]:
meta_props = {
    client.package_extensions.ConfigurationMetaNames.NAME: "Model Gateway extension",
    client.package_extensions.ConfigurationMetaNames.DESCRIPTION: "Package extension with Model Gateway functionality enabled in ibm-watsonx-ai",
    client.package_extensions.ConfigurationMetaNames.TYPE: "requirements_txt",
}

package_extension_details = client.package_extensions.store(
    meta_props, file_path="requirements.txt"
)
package_extension_id = client.package_extensions.get_id(package_extension_details)

Creating package extension
SUCCESS


Create a new software specification with the created package extension

In [16]:
meta_props = {
    client.software_specifications.ConfigurationMetaNames.NAME: "Model Gateway software specification",
    client.software_specifications.ConfigurationMetaNames.DESCRIPTION: "Software specification for Model Gateway",
    client.software_specifications.ConfigurationMetaNames.BASE_SOFTWARE_SPECIFICATION: {
        "guid": base_software_specification_id
    },
}

software_specification_details = client.software_specifications.store(meta_props)
software_specification_id = client.software_specifications.get_id(
    software_specification_details
)

client.software_specifications.add_package_extension(
    software_specification_id, package_extension_id
)

SUCCESS


'SUCCESS'

### Create AI service

Prepare function which will be deployed using AI service.

In [17]:
def deployable_ai_service(context, url=credentials.url, model_id=model, **kwargs): # fmt: skip
    from ibm_watsonx_ai import APIClient, Credentials
    from ibm_watsonx_ai.gateway import Gateway

    api_client = APIClient(
        credentials=Credentials(url=url, token=context.generate_token()),
        space_id=context.get_space_id(),
    )

    gateway = Gateway(api_client=api_client)

    def generate(context) -> dict:
        api_client.set_token(context.get_token())

        payload = context.get_json()
        prompt = payload["prompt"]

        messages = [
            {
                "role": "user",
                "content": prompt,
            }
        ]

        response = gateway.chat.completions.create(model=model_id, messages=messages)

        return {"body": response}

    return generate

### Testing AI service's function locally

Create AI service function

In [18]:
from ibm_watsonx_ai.deployments import RuntimeContext

context = RuntimeContext(api_client=client)
local_function = deployable_ai_service(context=context)

Prepare request payload

In [19]:
context.request_payload_json = {"prompt": "What is a tram?"}

Execute the function locally

In [20]:
import json

response = local_function(context)
print(json.dumps(response, indent=2))

{
  "body": {
    "id": "chatcmpl-02cb6edd-f726-478e-83dc-cd349a0e165c---163b0e4e57a9a9f02762f431f165f083---f96564ae-b29c-406c-8d02-e208035e8c5e",
    "object": "chat.completion",
    "created": 1758538027,
    "model": "ibm/granite-3-8b-instruct",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "A tram, also known as streetcar or trolley, is a rail-based public transport vehicle that operates on tracks laid in public right-of-way, usually in urban settings. Trams can be motorized or horse-drawn, but modern trams are typically electrically powered and run on dedicated lanes or shared streets. They are capable of carrying a moderate number of passengers and are often integrated into a city\u2019s existing public transport system. Trams are distinguished from light rail vehicles (LRVs) by their higher frequency of service, shorter headways, and operation in mixed traffic, whereas light rail vehicles usually run in sepa

### Deploy AI service

Store AI service with previously created custom software specification

In [21]:
meta_props = {
    client.repository.AIServiceMetaNames.NAME: "Model Gateway AI service with SDK",
    client.repository.AIServiceMetaNames.SOFTWARE_SPEC_ID: software_specification_id,
}

stored_ai_service_details = client.repository.store_ai_service(
    deployable_ai_service, meta_props
)

In [22]:
ai_service_id = client.repository.get_ai_service_id(stored_ai_service_details)
ai_service_id

'd0ab4ac0-347a-41b6-b392-6cff948cea57'

Create online deployment of AI service.

In [23]:
meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: "AI service with SDK",
    client.deployments.ConfigurationMetaNames.ONLINE: {},
}

deployment_details = client.deployments.create(ai_service_id, meta_props)



######################################################################################

Synchronous deployment creation for id: 'd0ab4ac0-347a-41b6-b392-6cff948cea57' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
......
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='c7741b7d-78f0-4fbb-b968-73afb56909a5'
-----------------------------------------------------------------------------------------------




Obtain the `deployment_id` of the previously created deployment.

In [24]:
deployment_id = client.deployments.get_id(deployment_details)

### Execute the AI service

In [25]:
question = "Summarize core values of IBM"

deployments_results = client.deployments.run_ai_service(
    deployment_id, {"prompt": question}
)

In [26]:
import json

print(json.dumps(deployments_results, indent=2))

{
  "cached": false,
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "IBM's core values are encapsulated in its \"Values at Work\" program, which consists of eight distinct principles:\n\n1. Dedication to delivering superior value to all stakeholders\n2. Innovation that matters for our company and our customers\n3. Integrity by earning trust through ethical and respectful behavior\n4. Inclusion and diversity - it's a core strength and a vital ingredient for innovation and growth\n5. Collaboration across the enterprise to best serve our customers\n6. Accountability for results in a results-driven culture\n7. Empowerment to act as one, including taking calculated risks and learning from mistakes\n8. Respect for the individual and the planet\n\nThese values guide the behavior and decision-making processes of IBM and its employees, promoting a strong organizational culture based on ethics, innovation, diver

<a id="create-models-ai-service-load-balancing"></a>

## Create models and deploy them as an AI service with load balancing
In this section we will create models with the same alias using Model Gateway and deploy them as an AI service in order to perform load balancing between them.

**Note:** This sample notebook creates three providers using watsonx.ai. It's worth pointing out that Model Gateway can also load balance between other providers, such as AWS Bedrock or NVIDIA NIM, as well as between different datacenters. 

### Create models using Model Gateway with the same alias on different providers
In this sample we will use the `ibm/granite-3-8b-instruct`, `meta-llama/llama-3-2-11b-vision-instruct`, and `meta-llama/llama-3-3-70b-instruct` models in the same datacenter.

**Tip:** It is also possible to perform load balancing across datacenters in different regions. In order to achieve it, when creating your providers you should use credentials for separate datacenters. See the example below:

In [27]:
model_alias = "load-balancing-models"

#### Create provider for `ibm/granite-3-8b-instruct` model

In [28]:
granite_3_model = "ibm/granite-3-8b-instruct"

watsonx_ai_provider_1_details = gateway.providers.create(
    provider="watsonxai", name="watsonx-ai-provider-1", secret_crn_id=secret_crn_id
)

watsonx_ai_provider_1_id = gateway.providers.get_id(watsonx_ai_provider_1_details)

granite_3_model_details = gateway.models.create(
    provider_id=watsonx_ai_provider_1_id, model=granite_3_model, alias=model_alias
)

granite_3_model_id = gateway.models.get_id(granite_3_model_details)

#### Create provider for `meta-llama/llama-3-2-11b-vision-instruct` model

In [29]:
llama_3_2_model = "meta-llama/llama-3-2-11b-vision-instruct"

watsonx_ai_provider_2_details = gateway.providers.create(
    provider="watsonxai", name="watsonx-ai-provider-2", secret_crn_id=secret_crn_id
)

watsonx_ai_provider_2_id = gateway.providers.get_id(watsonx_ai_provider_2_details)

llama_3_2_model_details = gateway.models.create(
    provider_id=watsonx_ai_provider_2_id, model=llama_3_2_model, alias=model_alias
)

llama_3_2_model_id = gateway.models.get_id(llama_3_2_model_details)

#### Create provider for `meta-llama/llama-3-3-70b-instruct` model

In [30]:
llama_3_3_model = "meta-llama/llama-3-3-70b-instruct"

watsonx_ai_provider_3_details = gateway.providers.create(
    provider="watsonxai", name="watsonx-ai-provider-3", secret_crn_id=secret_crn_id
)

watsonx_ai_provider_3_id = gateway.providers.get_id(watsonx_ai_provider_3_details)

llama_3_3_model_details = gateway.models.create(
    provider_id=watsonx_ai_provider_3_id, model=llama_3_3_model, alias=model_alias
)

llama_3_3_model_id = gateway.models.get_id(llama_3_3_model_details)

#### List available providers

In [31]:
gateway.providers.list()

Unnamed: 0,ID,NAME,TYPE
0,c18a1610-b2fa-47bb-a986-02bceff805d8,watsonx-ai-provider,watsonxai
1,ae732127-fd4e-4fa6-8ee3-14a385d09e63,watsonx-ai-provider-1,watsonxai
2,1d433374-5bff-46ea-8f9d-a920fa3e214c,watsonx-ai-provider-3,watsonxai
3,a6acebaf-0028-483a-9751-a13694b1eade,watsonx-ai-provider-2,watsonxai


#### List available models

In [32]:
gateway.models.list()

Unnamed: 0,ID,MODEL,CREATED,TYPE
0,bb8a37bc-ff47-40b8-9603-95db80aae2fd,load-balancing-models,2025-09-22 12:48:42,watsonxai:watsonx-ai-provider-3
1,ffeb2f2f-9ef3-4e3e-8b34-20c2e9e22dde,load-balancing-models,2025-09-22 12:48:33,watsonxai:watsonx-ai-provider-2
2,10911f1f-01ce-402c-ba1e-0fb36a0dae22,load-balancing-models,2025-09-22 12:48:30,watsonxai:watsonx-ai-provider-1
3,65ce9f6c-0aa6-4052-b84e-87b59a59586b,ibm/granite-3-8b-instruct,2025-09-22 12:46:13,watsonxai:watsonx-ai-provider


### Create AI service

Prepare function which will be deployed using AI service. Please specify the default parameters that will be passed to the function.

In [33]:
def deployable_load_balancing_ai_service(context, url=credentials.url, model_alias=model_alias, **kwargs): # fmt: skip
    from ibm_watsonx_ai import APIClient, Credentials
    from ibm_watsonx_ai.gateway import Gateway

    api_client = APIClient(
        credentials=Credentials(url=url, token=context.generate_token()),
        space_id=context.get_space_id(),
    )

    gateway = Gateway(api_client=api_client)

    def generate(context) -> dict:
        api_client.set_token(context.get_token())

        payload = context.get_json()
        prompt = payload["prompt"]

        messages = [
            {
                "role": "user",
                "content": prompt,
            }
        ]

        response = gateway.chat.completions.create(model=model_alias, messages=messages)

        return {"body": response}

    return generate

### Testing AI service's function locally

Create AI service function

In [34]:
from ibm_watsonx_ai.deployments import RuntimeContext

context = RuntimeContext(api_client=client)
local_load_balancing_function = deployable_load_balancing_ai_service(context=context)

Prepare request payload

In [35]:
context.request_payload_json = {"prompt": "Explain what IBM is"}

Execute the function locally

In [36]:
import asyncio
from collections import Counter
from typing import Coroutine


async def send_requests(function, context):
    tasks: list[Coroutine] = []
    for _ in range(25):
        task = asyncio.to_thread(function, context)
        tasks.append(task)
        await asyncio.sleep(0.2)

    return await asyncio.gather(*tasks)


loop = asyncio.get_event_loop()
responses = await loop.create_task(
    send_requests(function=local_load_balancing_function, context=context)
)

Counter(map(lambda x: x["body"]["model"], responses))

Counter({'ibm/granite-3-8b-instruct': 13,
         'meta-llama/llama-3-2-11b-vision-instruct': 7,
         'meta-llama/llama-3-3-70b-instruct': 5})

As demonstrated, out of 25 requests sent to Model Gateway:
- 13 of them were handled by `ibm/granite-3-8b-instruct`,
- 7 of them were handled by `meta-llama/llama-3-2-11b-vision-instruct`,
- 5 of them were handled by `meta-llama/llama-3-3-70b-instruct`.

### Deploy AI service

Store AI service with previously created custom software specification

In [37]:
meta_props = {
    client.repository.AIServiceMetaNames.NAME: "Model Gateway load balancing AI service with SDK",
    client.repository.AIServiceMetaNames.SOFTWARE_SPEC_ID: software_specification_id,
}

stored_ai_service_details = client.repository.store_ai_service(
    deployable_load_balancing_ai_service, meta_props
)

In [38]:
ai_service_id = client.repository.get_ai_service_id(stored_ai_service_details)
ai_service_id

'baaa36d4-88bf-4983-8ae3-3cbd27be4f3a'

Create online deployment of AI service.

In [39]:
meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: "Load balancing AI service with SDK",
    client.deployments.ConfigurationMetaNames.ONLINE: {},
}

deployment_details = client.deployments.create(ai_service_id, meta_props)



######################################################################################

Synchronous deployment creation for id: 'baaa36d4-88bf-4983-8ae3-3cbd27be4f3a' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
.....
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='a9571c6e-befc-4dff-9e5c-858bd47294a7'
-----------------------------------------------------------------------------------------------




Obtain the `deployment_id` of the previously created deployment.

In [40]:
deployment_id = client.deployments.get_id(deployment_details)

### Execute the AI service
In the following cell there are 25 requests send to the AI service in asynchronous mode. Between each request there is a 0.2 second delay in order to avoid `429 Too Many Requests` errors.

In [41]:
async def send_requests(question):
    tasks: list[Coroutine] = []
    for _ in range(25):
        tasks.append(
            client.deployments.arun_ai_service(deployment_id, {"prompt": question})
        )

        await asyncio.sleep(0.2)

    return await asyncio.gather(*tasks)


loop = asyncio.get_event_loop()
responses = await loop.create_task(
    send_requests(question="Explain to me what is a dog in cat language")
)

Counter(map(lambda x: x["model"], responses))

Counter({'ibm/granite-3-8b-instruct': 12,
         'meta-llama/llama-3-2-11b-vision-instruct': 10,
         'meta-llama/llama-3-3-70b-instruct': 3})

As demonstrated, out of 25 requests sent to AI Service:
- 12 of them were handled by `ibm/granite-3-8b-instruct`,
- 10 of them were handled by `meta-llama/llama-3-2-11b-vision-instruct`,
- 3 of them were handled by `meta-llama/llama-3-3-70b-instruct`.

<a id="summary"></a>
## Summary and next steps

You successfully completed this notebook!

You learned how to create and deploy a load-balancing AI service with Model Gateway using `ibm_watsonx_ai` SDK.

Check out our _<a href="https://ibm.github.io/watsonx-ai-python-sdk/samples.html" target="_blank" rel="noopener no referrer">Online Documentation</a>_ for more samples, tutorials, documentation, how-tos, and blog posts. 

### Author

**Rafał Chrzanowski**, Software Engineer Intern at watsonx.ai.

Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.