# Text-To-Text LLM Server

**important: Select venv Python Interpreter before you start**

This repository is designed to be used with Visual Studio Code and Docker DevContainer.

![dev-container](../img/dev-container.png)

## 1. Setup

**Instructions:**

a) Download model

```bash
huggingface-cli download bartowski/dolphin-2.9.3-mistral-7B-32k-GGUF \
    dolphin-2.9.3-mistral-7B-32k-Q4_K_M.gguf \
    --revision 740ce4567b3392bd065637d2ac29127ca417cc45 \
    --local-dir ~/.gai/models/llamacpp-dolphin \
    --local-dir-use-symlinks False
```

or

```bash
huggingface-cli download bartowski/Mistral-7B-Instruct-v0.3-GGUF \
    Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    --revision 61fd4167fff3ab01ee1cfe0da183fa27a944db48 \
    --local-dir ~/.gai/models/llamacpp-mistral7b \
    --local-dir-use-symlinks False
```

b) Create gai.yml in ~/.gai

```yaml
generators:
    ttt:
        default: "ttt-llamacpp-dolphin"
        configs:
            ttt-llamacpp-dolphin:
                type: "ttt"
                engine: "llamacpp"
                model: "dolphin"
                name: "ttt-llamacpp-dolphin"
                model_filepath: "models/llamacpp-dolphin/dolphin-2.9.3-mistral-7B-32k-Q4_K_M.gguf"
                max_seq_len: 4096
                prompt_format: "mistral"
                hyperparameters:
                    temperature: 0.85
                    top_p: 0.8
                    top_k: 50
                    max_tokens: 1000
                    tool_choice: "auto"
                    max_retries: 5
                    stop: ["<|im_end|>", "</s>", "[/INST]"]
                module:
                    name: "gai.ttt.server.gai_llamacpp"
                    class: "GaiLlamaCpp"
            ttt-llamacpp-mistral7b:
                type: "ttt"
                engine: "llamacpp"
                model: "mistral7b"
                name: "ttt-llamacpp-mistral7b"
                model_filepath: "models/llamacpp-mistral7b/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf"
                max_seq_len: 4096
                prompt_format: "mistral"
                hyperparameters:
                    temperature: 0.85
                    top_p: 0.8
                    top_k: 50
                    max_tokens: 1000
                    tool_choice: "auto"
                    max_retries: 5
                    stop: ["<|im_end|>", "</s>", "[/INST]"]
                module:
                    name: "gai.ttt.server.gai_llamacpp"
                    class: "GaiLlamaCpp"

```

---
## 2. Smoke Test

In [1]:
# check .gairc
import os
gairc=None
with open(os.path.expanduser("~/.gairc"),"r") as f:
    gairc = f.read()
print(gairc)

# check ~/.gairc (if docker created .gairc)
import json
jsoned=json.loads(gairc)
assert os.path.expanduser(jsoned["app_dir"])=="/home/kakkoii1337/.gai"

# check ~/.gai (if docker created the mount point)
assert os.path.exists(os.path.expanduser("~/.gai"))

# Initiate
from gai.lib.server.singleton_host import SingletonHost
from gai.lib.common.utils import free_mem
from rich.console import Console
console=Console()

from gai.ttt.server.config.ttt_config import TTTConfig
ttt_config = TTTConfig(
    type="ttt",
    engine="llamacpp",
    model="dolphin",
    name="ttt-exllamav2-dolphin",
    model_filepath="models/llamacpp-dolphin/dolphin-2.9.3-mistral-7B-32k-Q4_K_M.gguf",
    max_seq_len=4096,
    prompt_format="mistral",
    hyperparameters={
        "temperature": 0.85,
        "top_p": 0.8,
        "top_k": 50,
        "max_tokens": 1000,
        "tool_choice": "auto",
        "max_retries": 5,
        "stop": ["<|im_end|>", "</s>", "[/INST]"],
    },
    module={
        "name": "gai.ttt.server.gai_llamacpp",
        "class": "GaiLlamaCpp"
    }
)

# before loading
free_mem()
try:
    with SingletonHost.GetInstanceFromConfig(ttt_config) as host:

        # after loading
        free_mem()
except Exception as e:
    raise e
finally:
    # after disposal
    free_mem()

{"app_dir":"/home/kakkoii1337/.gai"}



---
## 3. Integration Test

### Startup

In [2]:
from gai.lib.server.singleton_host import SingletonHost
host = SingletonHost.GetInstanceFromConfig(ttt_config, verbose=False)
host.load()
generator = host.generator
free_mem()

5.2718048095703125

### a) Testing streaming

In [3]:
response = host.generator.create(
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=True)
for chunk in response:
    if chunk:
        print(chunk.choices[0].delta.content, end="", flush=True)


Once upon a time, in a small village nestled between a dense forest and a towering mountain, there lived an old woman known for her wisdom and kindness. One day, a young traveler arrived at her door, exhausted and beaten by a recent storm. The old woman welcomed him in, tended to his wounds, and gave him a warm meal. As the traveler rested, he shared stories of the world beyond their village. The old woman listened, her eyes sparkling with curiosity and longing. When the traveler left, she stood at her door, watching him go with a mix of sadness and excitement. That night, she wrote in her journal, "One must always be ready to step beyond the known, for it is only in the unknown that we find ourselves."

### b) Test generation

In [4]:
response = host.generator.create(
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=False)
print(response.choices[0].message.content)


Once upon a time, in a small village surrounded by lush greenery, there lived a kind-hearted old man named Samuel. Despite his age, Samuel was known for his strength and resilience, often helping his neighbors with their heavy chores. One day, a powerful storm struck the village, uprooting trees and causing havoc. Samuel, with his unwavering spirit, led the villagers in clearing the destruction, demonstrating that true power lies not in physical strength, but in the heart.


### c) Test Tool Calling

In [5]:
messages = [
    {"role":"user","content":"What is the current time in Singapore?"},
    {"role":"assistant","content":""}
]
tool_choice="required"
tools = [
    {
        "type": "function",
        "function": {
            "name": "google",
            "description": "The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively."
                    }
                },
                "required": ["search_query"]
            }
        }
    }
]
response = host.generator.create(
    messages=messages,
    tools=tools,
    tool_choice=tool_choice,
    stream=False)
print(response)


additional-kv ::= string [:] space additional-value 
string ::= ["] string_103 ["] space 
space ::= space_102 
additional-value ::= object 
additional-kvs ::= additional-kv additional-kvs_6 
additional-kvs_5 ::= [,] space additional-kv 
additional-kvs_6 ::= additional-kvs_5 additional-kvs_6 | 
object ::= [{] space object_97 [}] space 
array ::= [[] space array_13 []] space 
array_9 ::= value array_12 
value ::= object | array | string | number | boolean | null 
array_11 ::= [,] space value 
array_12 ::= array_11 array_12 | 
array_13 ::= array_9 | 
boolean ::= boolean_15 space 
boolean_15 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] 
char ::= [^"\] | [\] char_17 
char_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
decimal-part ::= [0-9] decimal-part_48 
decimal-part_19 ::= [0-9] decimal-part_47 
decimal-part_20 ::= [0-9] decimal-part_46 
decimal-part_21 ::= [0-9] decimal-part_45 
decimal-part_22 ::= [0-9] decimal-part_44 
decimal-part_23 ::= [0-9] decimal-part_43 

### d) Test Structured Output

In [6]:
# Define Schema
from pydantic import BaseModel
class Book(BaseModel):
    title: str
    summary: str
    author: str
    published_year: int

text = """Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy (later
expanded into the Foundation series). Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.
"""
response = host.generator.create(messages=[{'role':'user','content':text},{'role':'assistant','content':''}], 
    json_schema=Book.schema(),
    stream=False
    )
print(response)


/tmp/ipykernel_22158/2228101676.py:18: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  json_schema=Book.schema(),


author-kv ::= ["] [a] [u] [t] [h] [o] [r] ["] space [:] space string 
space ::= space_43 
string ::= ["] string_44 ["] space 
char ::= [^"\] | [\] char_4 
char_4 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
integer ::= integer_6 space 
integer_6 ::= integer_7 integral-part 
integer_7 ::= [-] | 
integral-part ::= [0-9] | [1-9] integral-part_38 
integral-part_9 ::= [0-9] integral-part_37 
integral-part_10 ::= [0-9] integral-part_36 
integral-part_11 ::= [0-9] integral-part_35 
integral-part_12 ::= [0-9] integral-part_34 
integral-part_13 ::= [0-9] integral-part_33 
integral-part_14 ::= [0-9] integral-part_32 
integral-part_15 ::= [0-9] integral-part_31 
integral-part_16 ::= [0-9] integral-part_30 
integral-part_17 ::= [0-9] integral-part_29 
integral-part_18 ::= [0-9] integral-part_28 
integral-part_19 ::= [0-9] integral-part_27 
integral-part_20 ::= [0-9] integral-part_26 
integral-part_21 ::= [0-9] integral-part_25 
integral-part_22 ::= [0-9] integral-part_24 


---
## 4. API Test

**Instructions**:

a) Press `F5` to start the API server.

b) Wait for the server to start.

**Tests**:

Run the following cells to test the API.

### a) Test Generating

In [7]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"llamacpp-dolphin\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Tell me a story.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"tool_choice\": \"none\"}"
        

CalledProcessError: Command 'b'curl -X POST \\\n    http://localhost:12031/gen/v1/chat/completions \\\n    -H \'Content-Type: application/json\' \\\n    -s \\\n    -N \\\n    -d "{\\"model\\":\\"llamacpp-dolphin\\", \\\n        \\"messages\\": [ \\\n            {\\"role\\": \\"user\\",\\"content\\": \\"Tell me a story.\\"}, \\\n            {\\"role\\": \\"assistant\\",\\"content\\": \\"\\"} \\\n        ],\\\n        \\"tool_choice\\": \\"none\\"}"\n        \n'' returned non-zero exit status 52.

### b) Test Streaming

In [14]:
import json
import httpx
import asyncio
from openai import ChatCompletion

json_payload = {
    "temperature": 0.2,
    "max_tokens": 50,
    "stream": "true",  # This should probably be a boolean True, not "true"
    "messages": [
        {
            "role": "user",
            "content": "Tell me a one paragraph story."
        },
        {
            "role": "assistant",
            "content": ""
        }
    ]
}
async def http_post_async(json_payload):

    # Send the POST request using httpx with streaming
    async with httpx.AsyncClient(timeout=30.0) as client:
        async with client.stream("POST", "http://localhost:12031/gen/v1/chat/completions", json=json_payload) as response:
            response.raise_for_status()
            async for chunk in response.aiter_text():  # Use aiter_text() to handle decoding
                chunk=json.loads(chunk)
                chunk=chunk["choices"][0]["delta"]["content"]
                if chunk:  # Check for non-empty chunks
                    print(chunk, end="", flush=True)

response=await http_post_async(json_payload)


Once upon a time, in a small village nestled between two hills, there lived a young girl named Lily. She had a heart full of dreams and a spirit that refused to be tamed. Despite the limitations of her village, she yearned

In [None]:
import json
import httpx

# Generate the JSON payload
json_payload = {
    "temperature": 0.2,
    "max_new_tokens": 1000,
    "stream": "true",
    "messages": [
        {
            "role": "user",
            "content": "Tell me a one paragraph story."
        },
        {
            "role": "assistant",
            "content": ""
        }
    ]
}

# Send the POST request using httpx with streaming
with httpx.Client(timeout=30.0) as client:
    response = client.post("http://localhost:12031/gen/v1/chat/completions", json=json_payload)
    for line in response.iter_lines():
        result = json.loads(line)
        content = result["choices"][0]["delta"]["content"]
        if content:
            print(content, end="", flush=True)


### c) Test Tool Calling

In [7]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"What is the current time in Singapore\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"tools\": [\
            {\
                \"type\": \"function\",\
                \"function\": {\
                    \"name\": \"google\",\
                    \"description\": \"The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.\",\
                    \"parameters\": {\
                        \"type\": \"object\",\
                        \"properties\": {\
                            \"search_query\": {\
                                \"type\": \"string\",\
                                \"description\": \"The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively.\"\
                            }\
                        },\
                        \"required\": [\"search_query\"]\
                    }\
                }\
            }\
        ],\
        \"tool_choice\": \"required\"}"

{"id":"chatcmpl-cc9d6318-d239-4686-926d-6e7049346b0f","choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":null,"refusal":null,"role":"assistant","function_call":null,"tool_calls":[{"id":"call_191e7a67-4fd1-4d9d-a136-f8c153ae8a4c","function":{"arguments":"{\"location\": \"Singapore\"}","name":"ask_time"},"type":"function"}]}}],"created":1725122276,"model":"llamacpp-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":26,"prompt_tokens":15,"total_tokens":41}}

### d) Test JSON Schema

In [8]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"exllamav2-mistral7b\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Foundation is a science fiction novel by American writer \
            Isaac Asimov. It is the first published in his Foundation Trilogy (later \
            expanded into the Foundation series). Foundation is a cycle of five \
            interrelated short stories, first published as a single book by Gnome Press \
            in 1951. Collectively they tell the early story of the Foundation, \
            an institute founded by psychohistorian Hari Seldon to preserve the best \
            of galactic civilization after the collapse of the Galactic Empire.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"json_schema\": {\"properties\": \
            {\"title\": \
                {\"title\": \"Title\", \"type\": \"string\"}, \
                    \"summary\": {\"title\": \"Summary\", \"type\": \"string\"}, \
                    \"author\": {\"title\": \"Author\", \
                    \"type\": \"string\"\
                }, \
                \"published_year\": {\
                    \"title\": \"Published Year\", \
                    \"type\": \"integer\"}}, \
                \"required\": [\
                    \"title\", \
                    \"summary\", \
                    \"author\", \
                    \"published_year\"\
                ], \
                \"title\": \"Book\", \
                \"type\": \"object\"\
            },\
        \"tool_choice\": \"none\"}"

{"id":"chatcmpl-5184ce5e-8f31-4f6e-aee5-6f2ebd71341d","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"{ \"title\": \"Foundation\", \"summary\": \"Foundation is a science fiction novel by American writer Isaac Asimov. It is the first published in his Foundation Trilogy (later expanded into the Foundation series). Foundation is a cycle of five interrelated short stories, first published as a single book by Gnome Press in 1951. Collectively they tell the early story of the Foundation, an institute founded by psychohistorian Hari Seldon to preserve the best of galactic civilization after the collapse of the Galactic Empire.\", \"author\": \"Isaac Asimov\", \"published_year\": 1951}","refusal":null,"role":"assistant","function_call":null,"tool_calls":null}}],"created":1725122322,"model":"llamacpp-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":146,"prompt_tokens":120,"total_tokens":266}}

### e) Shut down the API Service

---

## 5. Docker

**Instructions:** 

- Press **CTRL+SHIFT+P** > **Tasks: Run Task** > **docker-compose: up**

#### Smoke Test

In [1]:
%%bash
curl http://localhost:12031


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    34  100    34    0     0   1014      0 --:--:-- --:--:-- --:--:--  1030


{"message":"gai-ttt-svr-llamacpp"}

**Tests:**

Repeat the API test (#)

**Tear Down:**

- Press **CTRL+SHIFT+P** > **Tasks: Run Task** > **docker-compose: down**

### Debugging

a) Container must be started with "python -m debugpy --listen 0.0.0.0:5678 main.py"

b) Port 5678 must be opened.

c) Click on "Debug" in Tool bar

d) Select "Attach" > "Run and Debug"

e) Add a "breakpoint" in the code

f) Run the API test to see if it trigger the breakpoint.