# Notebook 1: LLM Streaming Client
This notebook demonstrates how to stream responses from the LLM. 

### NeMo Microservice Inference Server
The LLM has been deployed to [NVIDIA NeMo Microservice Inference Server](https://registry.ngc.nvidia.com/orgs/ohlfw0olaadg/teams/ea-participants/containers/nemollm-inference-ms) and leverages NVIDIA TensorRT-LLM (TRT-LLM), so it's optimized for low latency and high throughput inference. 

The **NeMo Microservice Inference** is used to communicate with the inference server hosting the LLM over the REST API. 

### Streaming LLM Responses
TRT-LLM on its own can provide drastic improvements to LLM response latency, but streaming can take the user-experience to the next level. Instead of waiting for an entire response to be returned from the LLM, chunks of it can be processed as soon as they are available. This helps reduce the perceived latency by the user. 

### Step 1: Structure the Query in a Prompt Template

A [**prompt template**](https://gpt-index.readthedocs.io/en/stable/api_reference/prompts.html) is a common paradigm in LLM development. 

They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `PROMPT_TEMPLATE`, which we modify to be constructed with:
- The system prompt
- The context
- The user's question

In [None]:
PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 "[/INST] {context} </s><s>[INST] {question} [/INST]"
)
# For nemotron model uncomment below prompt - prompts are model dependent and response vary depends on prompt
# PROMPT_TEMPLATE = (
#     "<extra_id_0>System"
#     "You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
#     "<extra_id_0>System"
#     "{context} \n {question} Given context followed by query, you try to answer the query truthfully"
#     "<extra_id_1>Assistant"
# )
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)

### Step 2: Create the Triton Client

Additional inputs to the LLM can be modified:
- [frequency_penalty](https://platform.openai.com/docs/guides/text-generation/parameter-details): reduce the likelihood of sampling repetitive sequences of tokens
- n: [1]: number of alternative text completions or choices to generate
- model: model name used for inference
- max_tokens: the maximum number of tokens (words/sub-words) generated
- stream: enable streaming
- temperature: [0,1] -- higher values produce more diverse outputs
- stop: specifies a list of stop tokens that signal the end of a response
- [top_p](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p): [0, 1] -- cumulative probability cutoff for token selection; lower values mean sampling from a smaller nucleus sample (reduces variety)

In [None]:
# If you've changed `MODEL_NAME` in compose.env, update model param with same name in pload
pload = {
            "prompt": prompt,
            "frequency_penalty": 0,
            "n": 1,
            "model": "Llama-2-13b-chat-hf",
            "max_tokens": 300,
            "stream": True,
            "temperature":1.0,
            "stop": ["</s>", "<extra_id_1>"],
}

### Step 3: Generate response from NeMo Microservice Inference Server.
The NeMo Microservice Inference Server hosts a REST API server with a schema similar to openai. To generate a response, you'll need to send a request to the NeMo Microservice Inference Server URL and receive the generated text.



<div class="alert alert-block alert-warning">
<b>WARNING!</b> Be sure to replace `triton_url` with the address and port that Triton is running on. 
</div>

Use the address and port that the Triton is available on; for example `localhost:9999`. 

**If you are running this notebook as part of the AI workflow, you dont have to replace the url**.

In [None]:
import requests
import json
import time

tokens_generated = 0
start_time = time.time()

server_url = "http://llm:9999/v1/completions"
response = requests.post(server_url, json=pload, stream=True)

current_string = ""
if response.status_code == 200:
    for chunk in response.iter_lines():
        chunk = chunk.decode("utf-8")
        if chunk:
            # data: is appended before every chunk, remove it to parse json
            chunk = chunk.lstrip("data: ")
            tokens_generated += 1
            try:
                # extract text from response
                chunk = json.loads(chunk)
                chunk = chunk.get("choices", [{}])[0].get("text", "")
            except Exception as e:
                # Non json data with [DONE] represent end of stream
                chunk = ""

            # Unlike openai nemo micorinference inference server returns complete response instead of generated token
            # find new generated chunk and send it for streaming
            resp = chunk[len(current_string) :]
            print(resp, end="", flush=True)
            current_string = chunk

total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")

### Step 4: Use Nemo Microserivice Inference Langchain wrapper to stream output using llm.
We establishes a connection to the Nemo Microservice Inference server running the TRT-LLM Llama-2 model. It utilizes the `NemoInfer` class which is a langchain wrapper for llm from the `nemo_infer`.

* `server_url`: The URL of the Nemo Microservice Inference server. Change `server_url` to where nemo inference ms is running. If you're running it as part of generative AI Workflow, you don't have to replace the llm url 

* `model`: The name of the model to use, which in this case is "Llama-2-13b-chat-hf".

* `callbacks`: A list of callbacks to be used during inference. The `streaming_stdout.StreamingStdOutCallbackHandler()` callback is used to print the streaming response to the console.

* `tokens`: The maximum number of tokens to generate.

The `llm` object represents the established connection to the Nemo Microservice Inference server and can be used to generate text using the Llama-2-13b-chat-hf model.



In [None]:
from nemo_infer import NemoInfer
from langchain.callbacks import streaming_stdout

callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]
# Connect to the TRT-LLM Llama-2 model running on the Nemo Microservice Inference server at the url below
llm = NemoInfer(server_url ="http://llm:9999/v1/completions", model="Llama-2-13b-chat-hf", callbacks=callbacks, tokens=500)

In [None]:
llm("Who is CEO of nvidia")