# Text-To-Text LLM Server

**important: Select venv Python Interpreter before you start**

This repository is designed to be used with Visual Studio Code and Docker DevContainer.

![dev-container](../img/dev-container.png)

## 1. Setup

**Instructions:**

a) Download model

```bash
huggingface-cli download bartowski/Mistral-7B-Instruct-v0.3-exl2 \
    --revision 1a09a351a5fb5a356102bfca2d26507cdab11111 \
    --local-dir ~/.gai/models/exllamav2-mistral7b \
    --local-dir-use-symlinks False
```

or

```bash
huggingface-cli download bartowski/dolphin-2.9-llama3-8b-exl2 \
    --revision 6521bd8b0f793a038f85d445316c94cdd0957d8e \
    --branch 4_25 \
    --local-dir ~/.gai/models/exllamav2-mistral7b \
    --local-dir-use-symlinks False
```

b) Create gai.yml in ~/.gai

```yaml
generators:
    ttt:
        default: "ttt-exllamav2-dolphin"
        configs:
            ttt-exllamav2-dolphin:
                type: "ttt"
                engine: "exllamav2"
                model: "dolphin"
                name: "ttt-exllamav2-dolphin"
                model_path: "models/exllamav2-dolphin"
                model_basename: "model"
                max_seq_len: 8192
                prompt_format: "mistral"
                hyperparameters:
                    temperature: 0.85
                    top_p: 0.8
                    top_k: 50
                    max_tokens: 1000
                    tool_choice: "auto"
                    max_retries: 5
                    stop: ["<|im_end|>", "</s>", "[/INST]"]
                extra:                    
                    no_flash_attn: true
                    seed: null
                    decode_special_tokens: false
                module:
                    name: "gai.ttt.server.gai_exllamav2"
                    class: "GaiExLlamav2"
            ttt-exllamav2-mistral7b:
                type: "ttt"
                engine: "exllamav2"
                model: "mistral7b"
                name: "ttt-exllamav2-mistral7b"
                model_path: "models/exllamav2-mistral7b"
                model_basename: "model"
                max_seq_len: 8192
                prompt_format: "mistral"
                hyperparameters:
                    temperature: 0.85
                    top_p: 0.8
                    top_k: 50
                    max_tokens: 1000
                    tool_choice: "auto"
                    max_retries: 5
                    stop: ["<|im_end|>", "</s>", "[/INST]"]
                extra:
                    no_flash_attn: true
                    seed: null
                    decode_special_tokens: false
                module:
                    name: "gai.ttt.server.gai_exllamav2"
                    class: "GaiExLlamav2"
```

---

## 2. Smoke Test

In [1]:
# check .gairc
import os
gairc=None
with open(os.path.expanduser("~/.gairc"),"r") as f:
    gairc = f.read()
print(gairc)

# check ~/.gairc (if docker created .gairc)
import json
jsoned=json.loads(gairc)
assert os.path.expanduser(jsoned["app_dir"])=="/home/kakkoii1337/.gai"

# check ~/.gai (if docker created the mount point)
assert os.path.exists(os.path.expanduser("~/.gai"))

# Initiate
from gai.lib.server.singleton_host import SingletonHost
from gai.lib.common.utils import free_mem
from rich.console import Console
console=Console()

from gai.ttt.server.config.ttt_config import TTTConfig
ttt_config = TTTConfig(
    type="ttt",
    engine="exllamav2",
    model="dolphin",
    name="ttt-exllamav2-dolphin",
    model_path="models/exllamav2-dolphin",
    max_seq_len=8192,
    prompt_format="mistral",
    hyperparameters={
        "temperature": 0.85,
        "top_p": 0.8,
        "top_k": 50,
        "max_tokens": 1000,
        "tool_choice": "auto",
        "max_retries": 5,
        "stop": ["<|im_end|>", "</s>", "[/INST]"],
    },
    extra={
        "no_flash_attn": True,
        "seed": None,
        "decode_special_tokens": False        
    },
    module={
        "name": "gai.ttt.server.gai_exllamav2",
        "class": "GaiExLlamav2"
    }
)

# before loading
free_mem()
try:
    with SingletonHost.GetInstanceFromConfig(ttt_config) as host:

        # after loading
        free_mem()
except Exception as e:
    raise e
finally:
    # after disposal
    free_mem()

{"app_dir":"/home/kakkoii1337/.gai"}



---

## 3. Integration Test

### Startup

In [2]:
host = SingletonHost.GetInstanceFromConfig(ttt_config, verbose=False)
host.load()
generator = host.generator
free_mem()

1.6793403625488281

### a) Test streaming

In [3]:
response = host.generator.create(
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=True)
for message in response:
    if message.choices[0].delta.content:
        print(message.choices[0].delta.content, end="", flush=True)
   

A young boy named Timmy was walking home from school one day when he found a shiny, silver coin on the ground. He picked it up and decided to keep it as a lucky charm. From that day on, he always carried the coin with him, believing it brought him good fortune. One day, while on a trip to the zoo with his class, he lost the coin. His friends and teachers searched high and low for it, but to no avail. Timmy was devastated, thinking he had lost his good luck charm forever. However, just as he was about to give up hope, he looked down and saw the coin shining up at him from a drain grate. It seemed that the coin had been waiting for him all along, and Timmy was once again filled with luck.

### b) Test generation

In [4]:

response = host.generator.create(
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=False)
print(response.choices[0].message.content)


A man named John, who lived in a small village, had always dreamed of traveling the world. One day, he found a mysterious map in his grandfather's attic. The map led him on an adventure across oceans, deserts, and mountains, where he encountered many wonders and made lifelong friends. In the end, he realized that the greatest adventure was not exploring the world, but the journey of self-discovery.


### c) Test Tool Calling

In [5]:
messages = [
    {"role":"user","content":"What is the current time in Singapore?"},
    {"role":"assistant","content":""}
]
tool_choice="required"
tools = [
    {
        "type": "function",
        "function": {
            "name": "google",
            "description": "The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively."
                    }
                },
                "required": ["search_query"]
            }
        }
    }
]
response = host.generator.create(
    messages=messages,
    tools=tools,
    tool_choice=tool_choice,
    stream=False)
print(response)


ChatCompletion(id='chatcmpl-ff6f4915-d19b-4edc-ba92-44bd99d51cab', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_df148ead-9b47-45b0-9be7-0b653ff29752', function=Function(arguments='{"search_query": "current time in Singapore"}', name='google'), type='function')]))], created=1734441085, model='exllamav2-mistral7b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=50, prompt_tokens=418, total_tokens=468, completion_tokens_details=None, prompt_tokens_details=None))


### d) Test Structured Output

In [6]:
# Define Schema
from pydantic import BaseModel
class Book(BaseModel):
    title: str
    summary: str
    author: str
    published_year: int

text = """Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy (later
expanded into the Foundation series). Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.
"""
response = host.generator.create(messages=[{'role':'user','content':text},{'role':'assistant','content':''}], 
    json_schema=Book.schema(),
    stream=False
    )
print(response)


/tmp/ipykernel_1508/2228101676.py:18: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  json_schema=Book.schema(),


ChatCompletion(id='chatcmpl-4b7b0ba2-957b-4078-800e-8920b09adf72', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n   "title": "Foundation",\n   "summary": "Foundation is a science fiction novel by American writer Isaac Asimov. It is the first published in his Foundation Trilogy (later expanded into the Foundation series). Foundation is a cycle of five interrelated short stories, first published as a single book by Gnome Press in 1951.",\n   "author": "Isaac Asimov",\n   "published_year": 1951\n }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1734441092, model='exllamav2-mistral7b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=105, prompt_tokens=253, total_tokens=358, completion_tokens_details=None, prompt_tokens_details=None))


### Teardown

In [7]:
del host.generator.model
del host.generator.cache
del host.generator.tokenizer
del host.generator
import gc,torch
gc.collect()
torch.cuda.empty_cache()
free_mem()

1.8934059143066406

---

## 4. API Test

**Instructions**:

a) Open Debug Icon and select **Python Debugger: gai-ttt server (dolphin)**

b) Press `F5` to start the API server.

c) Wait for the server to start.


**Tests**:

Run the following cells to test the API.

### a) Test Generating

In [9]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Tell me a story.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"max_tokens\":100,\
        \"tool_choice\": \"none\"}"
        

{"id":"chatcmpl-99e5e1ab-4c3a-48d8-87c1-438ff237c242","choices":[{"finish_reason":"length","index":0,"logprobs":null,"message":{"content":"Once upon a time, in a land far, far away, there was a young girl named Sarah. She lived in a small village surrounded by a dense forest. The villagers lived in harmony with nature, but they were always anxious about the mysterious monsters that lurked in the forest.\n\n One day, Sarah decided to venture into the forest to find out if the stories about the monsters were true. As she walked deeper into the forest, she heard strange noises and felt a chill","refusal":null,"role":"assistant","audio":null,"function_call":null,"tool_calls":null}}],"created":1734441323,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":100,"prompt_tokens":31,"total_tokens":131,"completion_tokens_details":null,"prompt_tokens_details":null}}

### b) Test Streaming

In [10]:
import json
import httpx
import asyncio
from openai import ChatCompletion

json_payload = {
    "temperature": 0.2,
    "max_tokens": 50,
    "stream": "true",  # This should probably be a boolean True, not "true"
    "messages": [
        {
            "role": "user",
            "content": "Tell me a one paragraph story."
        },
        {
            "role": "assistant",
            "content": ""
        }
    ]
}
async def http_post_async(json_payload):

    # Send the POST request using httpx with streaming
    async with httpx.AsyncClient(timeout=30.0) as client:
        async with client.stream("POST", "http://localhost:12031/gen/v1/chat/completions", json=json_payload) as response:
            response.raise_for_status()
            async for chunk in response.aiter_text():  # Use aiter_text() to handle decoding
                chunk=json.loads(chunk)
                chunk=chunk["choices"][0]["delta"]["content"]
                if chunk:  # Check for non-empty chunks
                    print(chunk, end="", flush=True)

response=await http_post_async(json_payload)


Once upon a time, in a small village nestled between towering mountains, lived a young girl named Mia. She was known for her radiant smile and her kind heart. One day, while wandering in the forest, she stumbled upon a

### c) Test Tool Calling

In [11]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"What is the current time in Singapore\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"tools\": [\
            {\
                \"type\": \"function\",\
                \"function\": {\
                    \"name\": \"google\",\
                    \"description\": \"The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.\",\
                    \"parameters\": {\
                        \"type\": \"object\",\
                        \"properties\": {\
                            \"search_query\": {\
                                \"type\": \"string\",\
                                \"description\": \"The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively.\"\
                            }\
                        },\
                        \"required\": [\"search_query\"]\
                    }\
                }\
            }\
        ],\
        \"tool_choice\": \"required\"}"

{"id":"chatcmpl-2c9fea01-2e20-455d-9bc0-b7eec67e8926","choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":null,"refusal":null,"role":"assistant","audio":null,"function_call":null,"tool_calls":[{"id":"call_9182d5e9-212e-49e9-9a69-6e662a3038e1","function":{"arguments":"{\"search_query\": \"current time Singapore\"}","name":"google"},"type":"function"}]}}],"created":1734441335,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":48,"prompt_tokens":432,"total_tokens":480,"completion_tokens_details":null,"prompt_tokens_details":null}}

### d) Test JSON Schema

In [1]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Foundation is a science fiction novel by American writer \
            Isaac Asimov. It is the first published in his Foundation Trilogy (later \
            expanded into the Foundation series). Foundation is a cycle of five \
            interrelated short stories, first published as a single book by Gnome Press \
            in 1951. Collectively they tell the early story of the Foundation, \
            an institute founded by psychohistorian Hari Seldon to preserve the best \
            of galactic civilization after the collapse of the Galactic Empire.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"json_schema\": {\"properties\": \
            {\"title\": \
                {\"title\": \"Title\", \"type\": \"string\"}, \
                    \"summary\": {\"title\": \"Summary\", \"type\": \"string\"}, \
                    \"author\": {\"title\": \"Author\", \
                    \"type\": \"string\"\
                }, \
                \"published_year\": {\
                    \"title\": \"Published Year\", \
                    \"type\": \"integer\"}}, \
                \"required\": [\
                    \"title\", \
                    \"summary\", \
                    \"author\", \
                    \"published_year\"\
                ], \
                \"title\": \"Book\", \
                \"type\": \"object\"\
            },\
        \"stream\": false }"

{"id":"chatcmpl-3171cb79-cc8a-4d01-8861-71997faa55af","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"{\n   \"title\": \"Foundation\",\n   \"summary\": \"Foundation is a science fiction novel by American writer Isaac Asimov. It is the first published in his Foundation Trilogy (later expanded into the Foundation series). Foundation is a cycle of five interrelated short stories, first published as a single book by Gnome Press in 1951. Collectively they tell the early story of the Foundation, an institute founded by psychohistorian Hari Seldon to preserve the best of galactic civilization after the collapse of the Galactic Empire.\",\n   \"author\": \"Isaac Asimov\",\n   \"published_year\": 1951\n }","refusal":null,"role":"assistant","audio":null,"function_call":null,"tool_calls":null}}],"created":1734441828,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":148,"prompt_toke

### e) Shut down the API Service

---

## 5. Docker

This test should **NOT** be run in devcontainer.

**Instructions:** 

- Press **CTRL+SHIFT+P** > **Tasks: Run Task** > **docker-compose: up**

**Tests:**

Repeat the API test (#)

**Tear Down:**

- Press **CTRL+SHIFT+P** > **Tasks: Run Task** > **docker-compose: down**

#### Smoke Test

In [1]:
%%bash
curl http://localhost:12031


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    35  100    35    0     0    964      0 --:--:-- --:--:-- --:--:--  1060


{"message":"gai-ttt-svr-exllamav2"}

**Tests:**

Repeat the API test (#)

**Tear Down:**

- Press **CTRL+SHIFT+P** > **Tasks: Run Task** > **Docker: stop**

### Debugging

a) Container must be started with "python -m debugpy --listen 0.0.0.0:5678 main.py"

b) Port 5678 must be opened.

c) Click on "Debug" in Tool bar

d) Select "Attach" > "Run and Debug"

e) Add a "breakpoint" in the code

f) Run the API test to see if it trigger the breakpoint.