# Invoking and Testing the Large Language Model Inference Service

Welcome to the fifth and final part of the tutorial series on building a question-answering application over a corpus of
private documents using Large Language Models (LLMs). In the previous Notebooks, you've covered the processes of
creating vector embeddings of our documents, deploying a Vector Store Inference Service (ISVC), creating an LLM ISVC,
and enriching user queries with relevant context using an ISVC Transformer component.

In this Notebook, you focus on the crucial task of invoking and testing the LLM ISVC you've created. This is an
important step in the development process as it allows you to validate the functionality and performance of your
service in a practical setting.

Throughout this Notebook, you see how to construct and send requests to the LLM ISVC, interpret the responses, and
handle potential issues that might arise. By the end of this Notebook, you will have gained practical experience in
working with the LLM ISVC, preparing you to integrate it into larger systems or applications.

## Table of Contents

1. [Invoking the LLM Inference Service](#invoking-the-llm-inference-service)
1. [Conclusion](#conclusion)

In [None]:
import requests
import ipywidgets as widgets

from IPython.display import display

# Invoking the LLM Inference Service

You are now ready to test your service. Provide your question and get back the answer from the LLM inference service.

In [None]:
# Add heading
heading = widgets.HTML("<h2>Credentials</h2>")
display(heading)

domain_input = widgets.Text(description='Username:', placeholder="i001ua.tryezmeral.com")
username_input = widgets.Text(description='Username:')
password_input = widgets.Password(description='Password:')
submit_button = widgets.Button(description='Submit')
success_message = widgets.Output()

def submit_button_clicked(b):
    global domain, username, password
    domain = domain_input.value
    username = username_input.value
    password = password_input.value
    with success_message:
        success_message.clear_output()
        print("Credentials submitted successfully!")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(domain_input, username_input, password_input, submit_button, success_message)

In [None]:
token_url = f"https://keycloak.{domain}/realms/UA/protocol/openid-connect/token"

data = {
    "username" : username,
    "password" : password,
    "grant_type" : "password",
    "client_id" : "ua-grant",
}

token_responce = requests.post(token_url, data=data, allow_redirects=True, verify=False)

token = token_responce.json()["access_token"]

In [None]:
DOMAIN_NAME = "svc.cluster.local"  # change this to your domain for external access
NAMESPACE = open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r").read()
DEPLOYMENT_NAME = "llm"
MODEL_NAME = DEPLOYMENT_NAME
SVC = f'{DEPLOYMENT_NAME}-transformer.{NAMESPACE}.{DOMAIN_NAME}'
URL = f"https://{SVC}/v1/models/{MODEL_NAME}:predict"

print(URL)

In [None]:
data = {
  "instances": [{
      "system": "You are an AI assistant. You will be given a task. You must generate a detailed answer.",
      "instruction": "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.",
      "input": "Can a process running in a Linux Namespace communicate with another process running in another Namespace?",
      "max_tokens": 50,
      "top_k": 100,
      "top_p": 0.4,
      "num_docs": 1,
      "temperature": 0.2
  }]
}

headers = {"Authorization": f"Bearer {token}"}

response = requests.post(URL, json=data, headers=headers, verify=False)

In [None]:
response.text

If you're executing this tutorial in an environment without access to a GPU device, the inference step might require
more time than usual. Please exercise patience and allow for approximately 5 minutes. In the unlikely event that you
encounter a time-out error, please attempt the process again.

Furthermore, increasing the number of retrieved documents (`num_docs`) increases the latency. For testing purposes keep
this number as low as possible.

# Conclusion

Congratulations on reaching the finish line of this comprehensive tutorial! You've successfully developed an application
capable of delivering responses to user queries in a natural language format. The journey has not only enhanced your
understanding but also allowed you to acquire hands-on experience in various facets of LLMs.

Throughout this process, you've demystified the concept of a Vector Store, created custom predictor and transformer
components, and learned to log artifacts with MLflow. Moreover, all these tasks have been accomplished within the
comfortable and familiar confines of your JupyterLab environment.

In conclusion, you've taken significant strides in your journey of mastering LLMs, and how to create real-world
applications using the EzUA platform. As a next step, you can follow the instructions in the README file and deploy the
front-end application of this service.