# Serving MistralLite on Custom Text Generation Inference Container

This notebook provides a step-by-step walkthrough of customizing an inference container and deploying the open source MistralLite model for natural language generation by modifying HuggingFace's [Text Generation Inference Container](https://github.com/huggingface/text-generation-inference). In this notebook, we will deploy the customized container for LLM inference, and invoke the deployed endpoint with example prompts. 

## Start TGI Server

Execute the following cells to deploy the LLM for long contexts. It may take a few minutes for the Docker container to initialize. 

In [1]:
!mkdir -p models

> **Warning:** You may need to wait for 10+ minutes for the docker container to be ready for the first time.

## Perform Inference

We can now invoke the model by first installing the `text-generation` library and define an invocation function to prompt the deployed model. Example prompts including a long-context prompt are included to execute.

In [4]:
!pip install text_generation==0.6.1

Collecting text_generation==0.6.1
  Obtaining dependency information for text_generation==0.6.1 from https://files.pythonhosted.org/packages/14/f7/cadf3a0fc619a72d7c667d16e96ef0a5b4c557e6e2b4788a0360dfba4fee/text_generation-0.6.1-py3-none-any.whl.metadata
  Downloading text_generation-0.6.1-py3-none-any.whl.metadata (7.8 kB)
Collecting aiohttp<4.0,>=3.8 (from text_generation==0.6.1)
  Obtaining dependency information for aiohttp<4.0,>=3.8 from https://files.pythonhosted.org/packages/2e/9f/9c37b01fc6a37c92f139a4cd937a92f03ebbd75379cfd55e85ca1e571643/aiohttp-3.8.6-cp311-cp311-win_amd64.whl.metadata
  Downloading aiohttp-3.8.6-cp311-cp311-win_amd64.whl.metadata (7.9 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp<4.0,>=3.8->text_generation==0.6.1)
  Downloading multidict-6.0.4-cp311-cp311-win_amd64.whl (28 kB)
Collecting async-timeout<5.0,>=4.0.0a3 (from aiohttp<4.0,>=3.8->text_generation==0.6.1)
  Obtaining dependency information for async-timeout<5.0,>=4.0.0a3 from https://files.python


[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
from text_generation import Client

SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)

def invoke_tgi(prompt, 
                      random_seed=1, 
                      max_new_tokens=400, 
                      print_stream=True,
                      assist_role=True):
    if (assist_role):
        prompt = f"<|prompter|>{prompt}</s><|assistant|>"
    output = ""
    for response in tgi_client.generate_stream(
        prompt,
        do_sample=False,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        #temperature=None,
        #truncate=None,
        #seed=random_seed,
        #typical_p=0.2,
    ):
        if hasattr(response, "token"):
            if not response.token.special:
                snippet = response.token.text
                output += snippet
                if (print_stream):
                    print(snippet, end='', flush=True)
    return output




In [6]:
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)

ConnectionError: HTTPConnectionPool(host='localhost', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000029779C07B10>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))