## 1. Install

Create conda env `gai-ttt-svr` and install gai-ttt-svr package. After that, switch the kernel to `gai-ttt-svr` before proceeding further.


In [None]:
%%bash
conda create -n gai-ttt-svr python=3.10.10 -y
eval "$(conda shell.bash hook)" && conda activate gai-ttt-svr
cd ../..
poetry install

## 2. Smoke Test

In [2]:
from gai.lib.server.singleton_host import SingletonHost
from gai.lib.common.utils import free_mem
from rich.console import Console
console=Console()

config = {
    "type": "ttt",
    "generator_name": "exllamav2-mistral7b",
    "engine": "gai.ttt.server.GaiExLlamaV2",
    "model_path": "models/exllamav2-mistral7b",
    "model_basename": "model",
    "max_seq_len": 8192,
    "prompt_format": "mistral",
    "hyperparameters": {
        "temperature": 0.85,
        "top_p": 0.8,
        "top_k": 50,
        "max_new_tokens": 1000,
    },
    "tool_choice": "auto",
    "max_retries": 5,
    "stop_conditions": ["<s>", "</s>", "user:","\n\n"],
    "no_flash_attn":True,
    "seed": None,
    "decode_special_tokens": False,
    "module_name": "gai.ttt.server.gai_exllamav2",
    "class_name": "GaiExLlamav2",
    "init_args": [],
    "init_kwargs": {}
}

# before loading
free_mem()
try:
    with SingletonHost.GetInstanceFromConfig(config) as host:

        # after loading
        free_mem()
except Exception as e:
    raise e
finally:
    # after disposal
    free_mem()

## 3. Completion

### Startup

In [3]:
host = SingletonHost.GetInstanceFromConfig(config, verbose=False)
host.load()
generator = host.generator
free_mem()

1.3845062255859375

### a) Test streaming

In [4]:
response = host.generator.create(
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=True)
for message in response:
    if message.choices[0].delta.content:
        print(message.choices[0].delta.content, end="", flush=True)
   

 Once upon a time, in a small village nestled between the mountains, lived a kind-hearted young woman named Maya. She was known for her exceptional skill in weaving intricate designs on cloth. One day, a mysterious stranger arrived in the village, seeking her help to weave a magical cloth that could heal the sick. Despite the challenges and threats from those who wished to claim the cloth for their own gain, Maya completed the cloth with courage and perseverance. In the end, the cloth healed the sick, and the stranger, who was actually a guardian angel, rewarded Maya by granting her a wish. Maya wished for the villagers' happiness and prosperity, and from that day forward, the village flourished like never before.

### b) Test generation

In [5]:

response = host.generator.create(
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=False)
print(response.choices[0].message.content)


 Once upon a time, in a small village nestled between the mountains, lived a kind-hearted and hardworking farmer named Tomas. Despite the harsh conditions, he worked tirelessly to tend to his crops and livestock. One day, a severe storm swept through the valley, destroying Tomas's home and crops. Devastated but not defeated, Tomas rallied the village together, and they rebuilt his home and replanted his fields. The villagers' unwavering support and Tomas's resilient spirit proved that even in the face of adversity, community and determination can help overcome the toughest challenges.


### c) Test Tool Calling

In [6]:
messages = [
    {"role":"user","content":"What is the current time in Singapore?"},
    {"role":"assistant","content":""}
]
tool_choice="required"
tools = [
    {
        "type": "function",
        "function": {
            "name": "google",
            "description": "The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively."
                    }
                },
                "required": ["search_query"]
            }
        }
    }
]
response = host.generator.create(
    messages=messages,
    tools=tools,
    tool_choice=tool_choice,
    stream=False)
print(response)


ChatCompletion(id='chatcmpl-109a250e-feb0-430a-a39b-946a109cebe9', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_c80da37d-8a6a-48f5-a6e1-2752915e2a12', function=Function(arguments='{"search_query": "current time Singapore"}', name='google'), type='function')]))], created=1723846895, model='exllamav2-mistral7b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=21, prompt_tokens=336, total_tokens=357))


### d) Test Structured Output

In [7]:
# Define Schema
from pydantic import BaseModel
class Book(BaseModel):
    title: str
    summary: str
    author: str
    published_year: int

text = """Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy (later
expanded into the Foundation series). Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.
"""
response = host.generator.create(messages=[{'role':'user','content':text},{'role':'assistant','content':''}], 
    json_schema=Book.schema(),
    stream=False
    )
print(response)


ChatCompletion(id='chatcmpl-7966bf89-fef5-40c9-add3-2460fb6383cc', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' {\n  "title": "Foundation",\n  "summary": "Foundation is a science fiction novel by Isaac Asimov, the first published in his Foundation Trilogy. It is a cycle of five interrelated short stories that tell the early story of the Foundation, an institute founded by psychohistorian Hari Seldon to preserve the best of galactic civilization after the collapse of the Galactic Empire.",\n  "author": "Isaac Asimov",\n  "published_year": 1951\n}', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1723846902, model='exllamav2-mistral7b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=116, prompt_tokens=257, total_tokens=373))


### Teardown

In [10]:
del host.generator.model
del host.generator.cache
del host.generator.tokenizer
del host.generator
import gc,torch
gc.collect()
torch.cuda.empty_cache()
free_mem()

5.8794403076171875

---

## 4. API Test

**Instructions**:

a) Press `F5` to start the API server.

**Tests**:

Run the following cells to test the API.

### a) Test Generating

In [12]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Tell me a story.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"tool_choice\": \"none\"}"

{"id":"chatcmpl-e842a513-c0f3-4858-a9ef-c4b63ae60ad1","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":" Once upon a time, in a land far, far away, there was a magical kingdom named Eldoria. The kingdom was known for its beautiful landscapes, friendly inhabitants, and enchanted spells","refusal":null,"role":"assistant","function_call":null,"tool_calls":null}}],"created":1723847508,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":44,"prompt_tokens":14,"total_tokens":58}}

### b) Test Streaming

In [13]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Tell me a story.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"stream\":true}"


{"id": "chatcmpl-bd5fcebb-d87b-4fb3-a2f2-0c468680c653", "choices": [{"delta": {"content": "", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null}], "created": 1723847517, "model": "exllamav2-mistral7b", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null}
{"id": "chatcmpl-7af82ed0-b11d-4f56-929c-d4694ed64e18", "choices": [{"delta": {"content": " Once", "function_call": null, "refusal": null, "role": null, "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null}], "created": 1723847518, "model": "exllamav2-mistral7b", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null}
{"id": "chatcmpl-0e6ac458-4ccb-4fdc-a36e-d4b6101919ef", "choices": [{"delta": {"content": " upon", "function_call": null, "refusal": null, "role": null, "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null}], "

### c) Test Tool Calling

In [14]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"What is the current time in Singapore\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"tools\": [\
            {\
                \"type\": \"function\",\
                \"function\": {\
                    \"name\": \"google\",\
                    \"description\": \"The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.\",\
                    \"parameters\": {\
                        \"type\": \"object\",\
                        \"properties\": {\
                            \"search_query\": {\
                                \"type\": \"string\",\
                                \"description\": \"The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively.\"\
                            }\
                        },\
                        \"required\": [\"search_query\"]\
                    }\
                }\
            }\
        ],\
        \"tool_choice\": \"required\"}"

{"id":"chatcmpl-32da141c-f092-40f4-98ae-8758b0e41794","choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":null,"refusal":null,"role":"assistant","function_call":null,"tool_calls":[{"id":"call_9c91aa88-2bef-40b3-aae0-2c5ddd7d13cb","function":{"arguments":"{\"search_query\": \"current time Singapore\"}","name":"google"},"type":"function"}]}}],"created":1723847533,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":21,"prompt_tokens":335,"total_tokens":356}}

### d) Test JSON Schema

In [15]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Foundation is a science fiction novel by American writer \
            Isaac Asimov. It is the first published in his Foundation Trilogy (later \
            expanded into the Foundation series). Foundation is a cycle of five \
            interrelated short stories, first published as a single book by Gnome Press \
            in 1951. Collectively they tell the early story of the Foundation, \
            an institute founded by psychohistorian Hari Seldon to preserve the best \
            of galactic civilization after the collapse of the Galactic Empire.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"json_schema\": {\"properties\": \
            {\"title\": \
                {\"title\": \"Title\", \"type\": \"string\"}, \
                    \"summary\": {\"title\": \"Summary\", \"type\": \"string\"}, \
                    \"author\": {\"title\": \"Author\", \
                    \"type\": \"string\"\
                }, \
                \"published_year\": {\
                    \"title\": \"Published Year\", \
                    \"type\": \"integer\"}}, \
                \"required\": [\
                    \"title\", \
                    \"summary\", \
                    \"author\", \
                    \"published_year\"\
                ], \
                \"title\": \"Book\", \
                \"type\": \"object\"\
            },\
        \"tool_choice\": \"none\"}"

{"id":"chatcmpl-257c7efd-79f2-49d4-850a-efb9ac036d2d","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":" {\n  \"title\": \"Foundation\",\n  \"summary\": \"Foundation is a science fiction novel by Isaac Asimov, the first published in his Foundation Trilogy. It is a cycle of five interrelated short stories that tell the early story of the Foundation, an institute founded by psychohistorian Hari Seldon to preserve the best of galactic civilization after the collapse of the Galactic Empire.\",\n  \"author\": \"Isaac Asimov\",\n  \"published_year\": 1951\n}","refusal":null,"role":"assistant","function_call":null,"tool_calls":null}}],"created":1723847541,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":116,"prompt_tokens":253,"total_tokens":369}}

### e) Shut down the API Service

## 5. Docker

a) Open Visual Code

b) Press `CTRL + SHIFT + P` and `Tasks: Run Task` > `build` and `run`

c) Check the docker logs to confirm the model is ready.

d) Repeat all the tests above.