# Creating a Large Language Model Inference Service

Welcome to the fourth part of the tutorial series on building a question-answering application over a corpus of private
...

## Table of Contents

1. [Architecture](#architecture)
1. [Creating the Inference Service](#creating-the-inference-service)
1. [Conclusion and Next Steps](#conclusion-and-next-steps)

In [None]:
import os
import subprocess
import requests
import ipywidgets as widgets
from IPython.display import display

# Architecture

In this setup, an additional component, called a "transformer", plays a pivotal role in processing user queries and
integrating the Vector Store ISVC with the LLM ISVC. ...

Here's a detailed look at the process:

1. **Intercepting the User's Request**: The transformer acts as a gateway between the user and the LLM ISVC. When a user
   sends a query, it first reaches the transformer. The transformer extracts the query from the request.
1. **Communicating with the Vector Store ISVC**: The transformer then takes the user's query and sends a POST request to the
   Vector Store ISVC including the user's query in the payload, just like you did in the previous Notebook.
1. **Receiving and Processing the Context**: The Vector Store ISVC responds by sending back the relevant context.
1. **Combining the Context with the User's Query**: The transformer then combines the received context with the user's
   original query using a prompt template. This creates an enriched prompt that contains both the user's original
   question and the relevant context from our documents.
1. **Forwarding the Enriched Query to the LLM Predictor**: Finally, the transformer forwards this enriched query to the LLM
   predictor. The predictor then processes this query and generates a response, which is sent back to the transformer.
   Steps 2 through 5 are transparent to the user.
1. **Final response**:The transformer returns the response to the user.

As such, you should build one custom Docker image at this point for the transformer component. The
source code and the Dockerfile is provided in the corresponding folder: `dockerfiles/transformer`.
For your convenience, you can use the image we have pre-built for you: `harbor.ezml.local/dev/tritonserver:24.05-vllm-with-s3`

Once ready, proceed with the next steps.

# Creating the Inference Service

As before, you need to provide the name of the transofmer image You can leave any field empty to use the image we
provide for you:

In [None]:
# Add heading
heading = widgets.HTML("<h2>Model Inference Service</h2>")
display(heading)

domain_input = widgets.Text(description='Domain:', placeholder="ezua1.local")
username_input = widgets.Text(description='Username:')
password_input = widgets.Password(description='Password:')
name_input = widgets.Text(description='Inference Service Name:', placeholder="vllm-triton")
container_image_input = widgets.Text(description='Container Image:', placeholder="harbor.ezml.local/dev/tritonserver:24.05-vllm-with-s3")
mlflow_run_id_input = widgets.Text(description='Mlflow Run ID:', placeholder="d3682c0c64834c398ff5f0f0754cd255")
limit_cpu_input = widgets.Text(description='Limit CPU:', placeholder="4")
limit_memory_input = widgets.Text(description='Limit Memory:', placeholder="4Gi")
limit_gpu_input = widgets.Text(description='Limit GPU:', placeholder="1")
submit_button = widgets.Button(description='Submit')
success_message = widgets.Output()

domain = None
mlflow_username = None
mlflow_password = None

def submit_button_clicked(b):    
    os.environ["DOMAIN"] = domain_input.value
    os.environ["USERNAME"] = username_input.value
    os.environ["PASSWORD"] = password_input.value
    os.environ["NAME"]= name_input.value
    os.environ["CONTAINER_IMAGE"] = mlflow_run_id_input.value
    os.environ["MLFLOW_RUN_ID"] = mlflow_run_id_input.value
    os.environ["LIMIT_CPU"] = limit_cpu_input.value
    os.environ["LIMIT_MEMORY"] = limit_memory_input.value
    os.environ["LIMIT_GPU"] = limit_gpu_input.value
    with success_message:
        success_message.clear_output()
        print("Credentials submitted successfully!")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(domain_input, username_input, password_input, name_input,  container_image_input, mlflow_run_id_input, limit_cpu_input, limit_memory_input, limit_gpu_input, submit_button, success_message)

Define and apply the LLM Inference Service:

In [None]:
yaml_content = f"""
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "{os.environ["NAME"]}"
spec:
  predictor:
    triton:
      image: "{os.environ["CONTAINER_IMAGE"]}"
      env:
      - name: DOMAIN
        value: "{os.environ["DOMAIN"]}"
      - name: USERNAME
        value: "{os.environ["USERNAME"]}"
      - name: PASSWORD
        value: "{os.environ["PASSWORD"]}"
      - name: MLFLOW_RUN_ID
        value: "{os.environ["MLFLOW_RUN_ID"]}"
      resources:
        limits:
          cpu: {os.environ["LIMIT_CPU"]}
          memory: "{os.environ["LIMIT_MEMORY"]}"
          nvidia.com/gpu: {os.environ["LIMIT_GPU"]}
        requests:
          cpu: 1
          memory: "2Gi"          
      storageUri: pvc://kubeflow-shared-pvc/model_repository
"""

with open("vllm-triton.yaml", "w") as f:
    f.write(yaml_content)

In [None]:
subprocess.run(["kubectl", "apply", "-f", "vllm-triton.yaml"])

# Conclusion and Next Steps

Congratulations on completing this crucial step in this tutorial series! You've successfully built an LLM ISVC, and
you've learned about the role of a transformer in enriching user queries with relevant context from our documents.
Together with the Vector Store ISVC, these components form the backbone of your question-answering application.

However, the journey doesn't stop here. The next and final step is to test the LLM ISVC, ensuring that it's working as
expected and delivering accurate responses. This will help you gain confidence in your setup and prepare you for
real-world applications. In the next Notebook, you invoke the LLM ISVC. You see how to construct suitable requests,
communicate with the service, and interpret the responses.