# Using the Model Inference Service for LLM Text Generation

TODO – ADD MATERIAL ON WHAT THE COMPLETION API ACTUALLY RETURNS

## Setup

### Setup functions

In [None]:
def create_pipe(model, fromFile=False):
    """
    model: The path or id of the model to use with vLLM
    fromFile: if True, call convert using the path specified by `model`.
    """
    if fromFile:
        pipe = c3.VllmPipe.convert(model, c3.VllmConvertSpec(trustRemoteCode=True, tensorParallelSize=8))
    else:
        pipe = c3.VllmPipe(modelId=model, trustRemoteCode=True, tensorParallelSize=8)
    return pipe.create(returnInclude='this')

def register_pipe_to_model_registry(pipe, uri, description):
    """
    pipe: pipe to register to model registry
    uri: uri to register the pipe with
    description: description of the pipe
    """
    return c3.ModelRegistry.registerMlPipe(pipe, uri, description)


def get_latest_entry(uri):
    """
    uri: uri of model registry entry 
    """
    vers = c3.ModelRegistry.listVersions(uri).objs
    return vers[0] if len(vers) > 0 else None


def create_nodepool(name="1t4"):
    hwProfile  = c3.HardwareProfile.upsertProfile({
  "name": '1xt4_30cpu_115mem',
  "cpu": 30,
  "memoryMb": 115000,
  "gpu": 1,
  "gpuVendor": "nvidia",
  "gpuKind": "nvidia-t4"
});
    
    c3.app().configureNodePool(name,                    # name of the node pool to configure 
                          1,                                   # sets the target node count
                          1,                                   # sets the minimum node count
                          1,                                   # sets the maximum node count
                          hwProfile,                        # sets the hardware profile
                          [c3.Server.Role.SERVICE],               # sets the server role that this node pool will function as                  
                          False,                           # optional - specifies whether autoscaling should be enabled
                          0.10                              # optional - JvmSpec
                          ).update();      

    
def deployModelForRoute(model, uri, description, nodepool_name, route, fromFile=False):
    """
    uri: The uri to register the pipe with in Model Registry
    """
    entry = get_latest_entry(uri)
    if not entry:
        pipe = create_pipe(model)
        entry = register_pipe_to_model_registry(pipe, uri, description)


    nodepool = c3.app().nodePool(nodepool_name)
    if nodepool is None:
        create_nodepool(nodepool_name)
        print(f"Created the nodepool: {nodepool_name}. Please edit the deployment to configure shared memory on the node")

    c3.ModelInference.deploy(entry, nodePool=nodepool_name)
    c3.ModelInference.setRoute(entry, route)    

### Setting up for completion

#### Start Required Microservices (Model Registry + Model Inference)


In [None]:
def _start_app_service(env, rootPkg, name):
    app = env.startApp(rootPkg=rootPkg, name=rootPkg.lower())
    c3.Microservice.Config.forName(name).setConfigValue("appId", app.id, c3.ConfigOverride.CLUSTER)
    return app

service_env = c3.env()    
#inference_app = _start_app_service(service_env, "modelInferenceService", "ModelInference")
registry_app = _start_app_service(service_env, "modelRegistryService", "ModelRegistry")

#### Creating and registering the pipe

In [None]:
modelId = "facebook/opt-125m"
pipe = create_pipe(modelId)

In [None]:
uri = "opt-125m"
description = "opt-125m facebook small model vanilla"

register_pipe_to_model_registry(pipe, uri, description)

#### Deploying the pipe and setting the route

In [None]:
create_nodepool()

In [None]:
deployModelForRoute(modelId, uri, description, '1t4', 'opt-125m')

In [None]:
list(c3.ModelInference.listRoutes())

## Introduction to the completion API

The `c3.ModelInference.completion()` API is the primary way to perform LLM text generation on the C3 Platform. 

In this tutorial, we detail the inputs to the API and demonstrate how to use it. By the end of the tutorial, you will be able to run your own completion request using the completion API as seen here:

In [None]:
# Define route:
route = 'opt-125m'

# Define prompts:
prompt_1 = "Respond to this question as if you were a computer scientist: What is the difference between interpreted and compiled programming languages?"
prompt_2 = "Respond to this question as if you were a data scientist: What is the difference between classification and regression?"
prompts = [prompt_1, prompt_2]

# Define params:
params = {
    'max_tokens' : 128,
    'temperature' : 0.5,
    'n' : 2
}

# Request LLM responses:
c3.ModelInference.completion(route=route, prompts=prompts, params=params)

## Inputs to the completion API

The inputs to the `c3.ModelInference.completion()` API are `route` , `prompts` and `params`. 

### The `route` input

The `route` input to the `c3.ModelInference.completion()` API is a string. 

It determines which LLM will be used to generate the responses for your prompts. The `route` input accepts a string that refers to a specific route that the Model Inference Service application administrator has set up for you to use. The `route` input is often a string like `opt-125m`, which would refer to the opt-125m model from Facebook. 

### The `prompts` input

The `prompts` input to the `c3.ModelInference.completion()` API is an list of strings.

It determines the prompts for which the LLM will generate responses.

### The `params` input

The `params` input to the `c3.ModelInference.completion()` API is a map from string to any.

It determines the model-specific parameters that will be used to generate the responses for your prompts.

## The completion API in practice

Let's go through an example of how you might use the completion API.

### Checking the available routes

The route determines which LLM will be used to generate the responses for your prompts. You can check which routes have been made available for you by the Model Inference Service application administrator.

In [None]:
list(c3.ModelInference.listRoutes())

We can see that the route `opt-125m` is available to generate responses. Thus, we will set our `route`.

In [None]:
route = 'opt-125m'

### Defining the prompts

Since the completion API accepts a list of strings as the prompts for which the LLM will generate responses, we assemble our list of prompts.

In [None]:
prompt_1 = "Respond to this question as if you were a computer scientist: What is the difference between interpreted and compiled programming languages?"
prompt_2 = "Respond to this question as if you were a data scientist: What is the difference between classification and regression?"
prompts = [prompt_1, prompt_2]

Please note that it is not recommended to provide more than 10 prompts in one call to the completion API.

### Defining the completion parameters

For a given model, there will be an assortment of parameters for generating text using that model. For our completions using `opt-125m`, we will use three text generation parameters that are common across many LLMs: `max_tokens`, `temperature`, and `n`.

#### Setting `max_tokens`

`max_tokens` specifies the maximum number of tokens that the LLM will generate. You can think of tokens roughly as words. In our case, we would like the generated text to have an approximate maximum length of 128 words.

In [None]:
max_tokens = 128

#### Setting `temperature`

`temperature` specifies the randomnness or creativity of the LLM's responses. 

Technically speaking, LLMs do not generate tokens. Rather, to generate a token, the LLM first generates a vector of probabilities, called the logit vector, where each entry in the vector corresponds to the probability that a certain token ought to appear. When this vector is generated, there is a random sampling from the probabilities to output one single token. `temperature` allows you to control the randomness/creativity of the responses by adjusting how the random sampling of tokens from the logit vector works. 

A temperature of `0` will make the text generated almost deterministic, as it limits the sampling to simply choose the token with the highest probability. Higher temperatures allow the sampling to use tokens with lower probabilities, thus generating more "creative" reponses. 

For our case, we will set the temperature to `0.5`, which should generate reasonable but different responses.

In [None]:
temperature = 0.5

#### Setting `n`

`n` controls the number of responses to generaate per prompt. For example, if I submit two prompt for completion with `n` set to `3`, I will receive three responses for each of my prompts, giving me six responses in total. In our case, we will set `n` to `2`.

In [None]:
n = 2

### Assembling the completion parameters

We can now assemble the completion parameters we will pass to the completion API. Remember, the `params` input is a map of string to any.

In [None]:
params = {
    'max_tokens' : max_tokens,
    'temperature' : temperature,
    'n' : n
}

### Generating responses

Finally, we can pass what we've defined to the completion API to generate responses.

In [None]:
c3.ModelInference.completion(route=route, prompts=prompts, params=params)