# Model Inference Service Administration

TODO – WRITE INSTRUCTIONS FOR CHUNK AND UPLOAD [HERE](#Vllmfiles)

TODO – ADDRESS ALI'S COMMENT (Add a section for waiting until engine is ready before overriding the routes. PLAT-69764: Enforce that warmup is completed for an MlAtomicPipe.Engine prior to allowing a route to be overridden)

TODO – ADDRESS ALI'S COMMENT (Add a section to clarify redeploying an entry with a new configuration might not work because of PLAT-74324: Re-deploying an entry with different configurations doesn't work. the workaround is to create a new model registry entry)
 

## Table of contents

- [Introduction](#Introduction)
    - [Definitions](#Definitions)
        - [Model Inference Service application](#MISapp)
        - [Pipe Registration application](#Pipeapp)
        - [Architecture 1 vs. architecture 2](#Architectures)
    - [Package dependencies](#Packagedependencies)
    - [Prerequisites](#Prereqs)
- [Service creation](#Servicecreation)
    - [Starting an environment](#Startingenv)
    - [Starting the services](#Startingservices)
    - [Starting the Model Inference Service](#Startingmodelinferenceservice)
    - [Starting a pipe registration application](#Startingpipereg)
    - [Starting a Model Registry](#Startingregistry)
    - [Starting a test client](#Startingclient)
    - [Connecting your client application(s) to the Model Inference Service](#Connecting)
- [Model serving](#Serving)
    - [Creating a VllmPipe](#Vllmpipe)
        - [Creating a VllmPipe using a model from Hugging Face Hub](#Vllmhf)
        - [Creating a VllmPipe using model files](#Vllmfiles)
    - [Registering a pipe to the Model Registry](#Registering)
    - [Pipe deployment](#Deployment)
        - [Nodepool creation](#Nodepool)
        - [Retreiving and deploying the pipe](#Deploy)
- [Route management](#Routes)
    - [Changing a route](#Changingroute)
    - [Terminating a deployment](#Terminating)
- [Monitoring the service](#Monitoring)
- [Scaling the service](#Scaling)
    - [Scaling up](#Scalingup)
    - [Scaling down](#Scalingdown)

## Introduction <a name="Introduction"></a>

This tutorial is for the administration of a Model Inference service. A C3 AI cluster that has an application requiring the usage of a large language model will likely require a Model Inference service to route and generate for text generation requests. In this tutorial, we demonstrate the creation and management of a Model Inference service and other related services. 

For app developers needing to use calls to an LLM in their application, please refer to the ModelInference-Client tutorial

### Definitions <a name="Definitions"></a>

#### Model Inference Service application <a name="MISapp"></a>

The Model Inference service application (MIS app) is the application used to deploy pipes for LLM completion and route all completion requests to the proper deployments.

#### Pipe registration application <a name="Pipeapp"></a>

The pipe registration application is the application used for registering pipes to the Model Registry.

#### Architecture 1 vs. architecture 2 <a name="Architectures"></a>

Depending on your team's preferences, you may want to opt for one these two architectures:

- Architecture 1: The Model Inference service application and the pipe registration application are *separate applications*.
- Architecture 2: The Model Inference service application and the pipe registration application are the *same application*.

In the diagrams in this tutorial, we assume that is a GenAI application that will be requesting LLM text generation.

Architecture 1 | Architecture 2
- | - 
![Architecture 1](architecture1.jpg) | ![alt](architecture2.jpg)

### Package dependencies <a name="Packagedependencies"></a>

The packages relevant to this tutorial are:
- `modelInferenceService` : The package used for creating the Model Inference service application.
- `modelInferencePipes` :  The package used for creating the pipe registration application.
- `modelInference` : The package used for the client of the Model Inference service, allowing it to use the completion API.
- `modelRegistryService` : The package used for creating a Model Registry.

In architecture 1, the Model Inference service application depends on the `modelInferenceService` package, which depends on the `modelInferencePipes` package. The pipe registration application only depends on the `modelInferencePipes` package. 

In architecture 2, the Model Inference service application depends on the `modelInferenceService` package, and there is no pipe registration application.

In both architectures, the GenAI application depends on the `genaibase` package and should depend on the `modelInference` package to support using the Model Inference Service for LLM text generation.

Architecture 1 | Architecture 2
- | - 
![Architecture 1](architecture1packages.jpg) | ![alt](architecture2packages.jpg)

### Prerequisites <a name="Prereqs"></a>

The Kubernetes nodepools must be configured with the required resources (GPUs, CPUs, memory, disk, etc.). This is typically done by Operations.

## Service creation <a name="Servicecreation"></a>

### Starting an environment <a name="Startingenv"></a>

From C3 AI Studio, start a multi-node environment called `inference` or a name of choice with server version `8.3.1` or greater. This environment will contain the applications needed to serve LLM completions to another application in the cluster.

### Starting the services <a name="Startingservices"></a>

Begin by accessing the static console for the multi-node environment you've created. Generally, this is accessed through a URL of the form `https://<cluster_name>.cloud/<environment_name>/c3/static/console/index.html`.

#### Starting the Model Inference Service <a name="Startingmodelinferenceservice"></a>

Open Chrome DevTools, and start the application for the Model Inference service.

```
var rootPkg = "modelInferenceService"
var appName = "service"
C3.env().startApp({rootPkg:rootPkg, name:appName})
```

#### Starting a pipe registration application <a name="Startingpipereg"></a>

Next, if you decided that architecture 1 (INSERT LINK) is best for your use case, you also need to start the pipe registration app.

```
var rootPkg = "modelInferencePipes"
var appName = "pipesapp"
C3.env().startApp({rootPkg:rootPkg, name:appName})
```

#### Starting a Model Registry <a name="Startingregistry"></a>

The Model Registry service should already be available in the cluster when AI Studio is deployed. Optionally, you can create a separate Model Registry for testing.

```
var rootPkg = "modelRegistryService"
var appName = "registryservice"
C3.env().startApp({rootPkg:rootPkg, name:appName})
```

#### Starting a test client <a name="Startingclient"></a>

If you'd like to set up an application to act as the client for the Model Inference service for testing, you need to start a client application.

```
var rootPkg = "modelInference"
var appName = "client"
C3.env().startApp({rootPkg:rootPkg, name:appName})
```

#### Connecting your client application(s) to the Model Inference Service <a name="Connecting"></a>

Finally, you must configure your client application to be able to connect to the Model Inference service application to request completions. In the case in which you're connecting to the test client application you created above, you would run the following:

```
var name = "ModelInference"
var serviceApp = App.forName('service')
Microservice.Config.forName(name).setConfigValue("appId", serviceApp.id, ConfigOverride.ENV)
```

Since we're using `ConfigOverride.ENV`, this will not only set the microservice configuration for the client application named "ModelInference", but it will also set the configuration for every application in the environment. To make sure that every application in the cluster uses this Model Inference service for LLM text generation, you can run the same with `ConfigOverride.CLUSTER`. This is not recommended.

In the case in which you'd like to connect an actual application in a different environment in the cluster to the Model Inference service (perhaps a GenAI application), you would need to access the static console of that application and run the following:

```
var name = "ModelInference"
var serviceAppId = '<CLUSTERNAME>-inference-service'
Microservice.Config.forName(name).setConfigValue("appId", serviceAppId, ConfigOverride.APP) # Note I'm doing it for APP so no other app is affected.
```

And similarly, if you set up the separate Model Registry for testing, you must similarly configure the client to reach out to that Model Registry.

```
var name = "ModelRegistry"
var serviceApp = App.forName('registryservice')
Microservice.Config.forName(name).setConfigValue("appId", serviceApp.id, ConfigOverride.ENV)
```

## Model serving <a name="Serving"></a>

In order to serve LLM for text generation on the C3 AI Platform, you need to
- Create and register a `VllmPipe` to the Model Registry.
- Create an appropriately sized nodepool for the deployment of the model.
- Deploy the `VllmPipe` to that nodepool.
- Set a route that can be used to access the deployment.

### Creating a `VllmPipe` <a name="Vllmpipe"></a>

A `VllmPipe` can be created by either downloading the files from Hugging Face Hub or by using model files you already have downloaded. Please note that a `VllmPipe` can only be created with specific, supported models. Currently, Falcon-40B is the only officially tested and supported on the C3 AI Platform, but theoretically supported models can be found [here](https://vllm.readthedocs.io/en/latest/models/supported_models.html).

#### Creating a `VllmPipe` using a model from Hugging Face Hub <a name="Vllmhf"></a>

In Jupyter, we create a `VllmPipe` by specifying
- `modelId` : This is the unique identifier of the model from Hugging Face Hub.
- `tensorParallelSize` : This is how many GPUs across which you would like to hold the model.

In [None]:
modelId = "tiiuae/falcon-40b"
pipe = c3.VllmPipe(modelId=modelId, tensorParallelSize=8).withDefaults()
pipe

#### Creating a `VllmPipe` using model files <a name="Vllmfiles"></a>

If you have model files locally, you can create the `VllmPipe` by specifying the `local_path` for the directory containing the model files and running the following:

In [None]:
pipe = c3.VllmPipe.convert(local_path)

Or, if the model files are located in a filesystem mounted to the cluster, you can create the `VllmPipe` by specifying the `filesystem_path` and running the following: TODO – WRITE INSTRUCTIONS FOR CHUNK AND UPLOAD HERE

### Registering a pipe to the Model Registry <a name="Registering"></a>

In [None]:
pipe = c3.VllmPipe(modelUrl=filesystem_path)

To serve LLM completions on the C3 AI Platform, a `VllmPipe` associated with an LLM must be registered to the Model Registry. This pipe registration must be done from one of the following applications depending on the choice of architecture:
- The pipe registration application (architecture 1) or
- The Model Inference service application (architecture 2)

In [None]:
c3.ModelRegistry.registerMlPipe(pipe, "f40b", "falcon40b-8gpu")

### Pipe deployment <a name="Deployment"></a>

#### Nodepool creation <a name="Nodepool"></a>

In order to deploy the pipe, we need to create a nodepool to which we will deploy it.

In [None]:
hwProfile = c3.HardwareProfile.upsertProfile({
    "name": '8x40a100_90cpu_600mem',
    "cpu": 90,
    "memoryMb": 600000,
    "gpu": 4,
    "gpuKind": 'nvidia-a100-40gb-8',
    "gpuVendor": 'nvidia',
    "diskGb" : 1000
});
    
name = '8xa100'
    
c3.app().configureNodePool(name,                    # name of the node pool to configure 
                          1,                        # sets the target node count
                          1,                        # sets the minimum node count
                          1,                        # sets the maximum node count
                          hwProfile,                # sets the hardware profile
                          [c3.Server.Role.SERVICE], # sets the server role that this node pool will function as                  
                          False,                    # optional - specifies whether autoscaling should be enabled
                          0.10                      # optional - JvmSpec
                          ).update();   

#### Retrieving and deploying the pipe <a name="Deploy"></a>

We retrive the latest entry for the `"f40b"` URI from the Model Registry. In order to deploy the pipe successfully, this must be performed in the Model Inference service application.

In [None]:
vers = c3.ModelRegistry.listVersions("f40b").objs
entry = vers[0]
entry

We use the `deploy()` API to deploy the entry to the nodepool we created.

In [None]:
c3.ModelInference.deploy(entry, nodePool="8xa100")

Finally, we set a route that can be used to access this deployment.

In [None]:
c3.ModelInference.setRoute(entry, "falcon40b")

Now the client application is able to use the `c3.ModelInference.completion()` API with this route to request test generation from this Falcon-40B LLM deployment. To test this, try running the following line from the client application in Jupyter:

In [None]:
c3.ModelInference.completion(route="falcon40b", prompts=["hello"], params = {'max_tokens' : 128})

Note that the `c3.ModelInference.completion()` API is also able to be called from the Model Inference service application for testing purposes.

## Route management <a name="Routes"></a>

It may become necessary to change/upgrade an LLM you are using. This can be done by deploying a new `VllmPipe` (if not already deployed) and then changing the route to use the new deployment instead of the old one. In this section, we demonstrate an example.

### Changing a route <a name="Changingroute"></a>

Let's assume that we are replacing the Falcon-40B model we deployed above with a new Falcon-40B model. Let's also assume that we've registered a `VllmPipe` for that model with the same uri `"f40b"` and created a new nodepool named `"8xa100_new"` for the deployment. We can run the following code to retrive the latest entry in the Model Registry for that URI:

In [None]:
vers = c3.ModelRegistry.listVersions("f40b").objs
new_entry = vers[0]

We deploy the new entry to the nodepool we created.

In [None]:
c3.ModelInference.deploy(new_entry, nodePool="8xa100_new")

And now that we've retreived and deployed the latest entry, we can simply change the route we were using before (`"falcon40b"`) to use this new entry instead of the old one, thereby upgrading the route to use our newer Falcon-40B large language model.

In [None]:
c3.ModelInference.setRoute(pipeEntry=new_entry, route="falcon40b", overwrite=True)

### Terminating a deployment <a name="Terminating"></a>

If we no longer need the old Falcon-40B deployment, we can terminate it.

In [None]:
# Terminate the engine
c3.ModelInference.terminate(entry, confirm=True, removeRoutes=True)

# Terminate the nodepool
c3.app().nodePool('8xa100').terminate()
c3.App.NodePool.Config.forName('8xa100').clearConfigAndSecretAllOverrides()

## Monitoring the service <a name="Monitoring"></a>

We can run the `c3.ModelInference.summary()` API to monitor the underlying engines, threadpools, and threads on our Model Inference service.

In [None]:
c3.ModelInference.summary()

## Scaling the service <a name="Scaling"></a>

It may become necessary to scale the service when request volume changes. Currently, this is done manually by changing the number of nodes in the nodepool corresponding to the pipe deployment. In the world of large language models, adding a new node to the nodepool corresponding to the pipe deployment is sometimes referred to as "creating a replica of the model," as the new node in the nodepool will have its own copy of the model in its GPU memory. 

### Scaling up <a name="Scalingup"></a>

In order to "create a replica," or scale up a deployment, we can run the following to increase the number of nodes in the nodepool from one to two.

In [None]:
nodepool_name = "8xa100_new"
num_nodes = 2

c3.app().nodePool(nodepool_name).setNodeCount(num_nodes, num_nodes, num_nodes).update()

We can get updates on the scaling by running the cell below.

In [None]:
import time

np1 = c3.app().nodePool(nodepool_name)
assert not np1.isReady()

num = 0
while not np1.isReady(): 
    res1 = c3.ModelInference.completion(route="falcon40b", prompts=["hello"], params={})
    text1 = assert_text_not_none(res1)
    print(f'Nodepool {nodepool1_name} is not ready.')
    print(f'Result of calling `completion`: {text1}')
    num += 1
    time.sleep(60)

assert num > 0
    
print(f'Nodepool {nodepool1_name} scaled up and is ready!')

This creates a new node in the nodepool we named `"8xa100_new"` that will serve the same model that we have deployed to that nodepool. 

### Scaling down <a name="Scalingdown"></a>

Scaling down is similar to scaling up. In order to scale down our deployment to one node, we can run the following:

In [None]:
nodepool_name = "8xa100_new"
num_nodes = 1

c3.app().nodePool(nodepool_name).setNodeCount(num_nodes, num_nodes, num_nodes).update()