# Gai/Gen: Text-to-Text (TTT)

## 1.1 Setting Up

We will create a seperate virtual environment for this to avoid conflicting dependencies that each underlying model requires.

```sh
sudo apt update -y && sudo apt install ffmpeg git git-lfs -y
conda create -n TTT python=3.10.10 -y
conda activate TTT
pip install -e ".[TTT]"
```

The following examples has been tested on the following environment:

-   NVidia GeForce RTX 2060 6GB
-   Windows 11 + WSL2
-   Ubuntu 22.04
-   Python 3.10
-   CUDA Toolkit 11.8
-   openai 1.6.1
-   anthropic 0.8.1
-   transformers 4.36.2
-   bitsandbytes 0.41.3.post2
-   scipy 1.11.4
-   accelerate 0.25.0
-   llama-cpp-python 0.2.25


## 1.2 Running as a Library

### OpenAI GPT4

For (1) and (2) below, you will use the GaiGen library to call OpenAI's GPT4.
You will need to get an API key from OpenAI. 
Create .env file in project root directory and insert the OpenAI API Key below:

```sh
OPENAI_API_KEY=<your key here>
```

In [None]:
### 1. GPT4 Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('gpt-4')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}], max_tokens=100,stream=True)
#print(response.choices[0].message.content)
for message in response:
    print(message)


In [None]:
### 2. GPT4 Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('gpt-4')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],stream=True)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content,end='',flush=True)

### Mistral 7B 8k-context 4-bit quantized

For (3) and (4), you will run Mistral 7B locally. Clone TheBloke's 4-bit quantized version of Mistral-7B model from hugging face. This model utilizes the exLlama loader for increased performance. Make sure you have huggingface-hub installed, if not run `pip install huggingface-hub`.

In [None]:
%%bash
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GPTQ \
        --local-dir ~/gai/models/Mistral-7B-Instruct-v0.1-GPTQ \
        --local-dir-use-symlinks False

In [3]:
### 3. Mistral Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100, stream=False)
print(response.choices[0].message.content)


GENERATING:
 Once upon a time, in a small village nestled at the foot of a mountain, there lived an old woman who was known for her wisdom and kindness. She had spent her entire life studying the mysteries of nature and the secrets of the universe, and she believed that everything happened for a reason. One day, as she sat on her porch watching the sun set over the mountains, she noticed a young boy playing in the field across the street


In [2]:
### 4. Mistral Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    if (chunk.choices[0].delta.content):
        print(chunk.choices[0].delta.content,end='',flush=True)


STREAMING:
 Once upon a time, in a small village nestled at the foot of a mountain, there lived an old woman who was known for her wisdom and kindness. She had spent her entire life studying the mysteries of nature and the secrets of the universe, and she had gained a deep understanding of the interconnectedness of all things. One day, as she sat by the riverbank, watching the sun set over the mountains, she felt a sense

### Yarn-Mistral-7B 128k-context 4-bit quantized

Repeat the earlier examples but using a different version of Mistral-7B model with a larger context window.

In [None]:
%%bash
huggingface-cli download TheBloke/Yarn-Mistral-7B-128k-GPTQ \
        --local-dir ~/gai/models/Yarn-Mistral-7B-128k-GPTQ \
        --local-dir-use-symlinks False

According to their paper, the perplexity seems better than the original once the token length is greater than 10k.

![perplexity-of-mistral7b-128k](https://raw.githubusercontent.com/jquesnelle/yarn/mistral/data/proofpile-long-small-mistral.csv.png)



In [None]:
### 3. Mistral Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b_128k-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100)
print(response.choices[0].message.content,end='',flush=True)

In [None]:
### 4. Mistral Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b_128k-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    if (chunk.choices[0].delta.content):
        print(chunk.choices[0].delta.content,end='',flush=True)

### Anthropics Claude2.1

The following example uses Anthropics Claude2.1 100k context window size model. Get API Key from Anthropics and add it to the .env file.
```sh
ANTHROPIC_APIKEY=<your key here>
```

In [None]:
### 5. Claude-2.1 Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('claude2-100k')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_tokens_to_sample=100)
print(response.choices[0].message.content,end='',flush=True)

In [None]:
### 6. Claude-2.1 Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('claude2-100k')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_tokens_to_sample=100,stream=True)
for chunk in response:
    print(chunk.choices[0].delta.content,end='',flush=True)

### Llama2 7B with HuggingFace transformers

Follow the instructions [here](https://huggingface.co/docs/transformers/main/en/model_doc/llama2) to signup with Meta to download the LLaMa-2 model.
Download the model in HuggingFace format from [here] (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) into ~/gai/models/Llama-2-7b-chat-hf.

In [None]:
### 7. Llama2-7B Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
from IPython.utils import io
with io.capture_output() as captured:
    gen = Gaigen.GetInstance().load('llama2-transformers')
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100)
print(response.choices[0].message.content,end='',flush=True)

In [None]:
### 8. Llama2-7B Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('llama2-transformers')
from IPython.utils import io
with io.capture_output() as captured:
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    print(chunk.choices[0].delta.content,end='',flush=True)

### Llama2 7B GGUF with LlaMaCPP (CPU only)

The following example uses GGUF formatted version of Mistral-7B for LlaMaCPP. This can be used when you want the model to run off CPU only

In [None]:
%%bash
# Download the model
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF \
                mistral-7b-instruct-v0.1.Q4_K_M.gguf  \
                config.json \
                --local-dir ~/gai/models/Mistral-7B-Instruct-v0.1-GGUF \
                --local-dir-use-symlinks False

In [None]:
## 9. Mistral-7B CPU-Only Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-llamacpp')
from IPython.utils import io
import sys
with io.capture_output() as captured:
    # Redirect stderr to stdout
    sys.stderr = sys.stdout    
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100)
print(response.choices[0].message.content,end='',flush=True)

In [None]:
## 10. Mistral-7B CPU-Only Text-to-Text Generation

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-llamacpp')
from IPython.utils import io
with io.capture_output() as captured:
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    print(chunk.choices[0].delta.content,end='',flush=True)

## 1.3 Using Function Call

OpenAPI provided a powerful feature for its API called Function calling. It is essentially a way for the LLM to seek external help when encountering limitation to its ability to generate text but returning a string emulating the calling of a function based on the function description provied by the user.

We extends this feature to the open source models.

In [13]:
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-exllama')

import json
system_prompt = """
        You will always begin your interaction by asking yourself if the user's message is a message that requires a tool response or a text response.
                        
        DEFINITIONS:
        1. A tool response is based on the following JSON format:
                <tool>
                {{
                    'type':'tool',
                    'tool': ...
                }}
                </tool>
        
           And the tool is chosen from the following <tools> list:
                <tools>
                {tools}
                </tools>.
            
        2. A text response is based on the following JSON format:
                <text>
                {{
                    'type':'text',
                    'text': ...
                }}
                </text>
        
        STEPS:
        1. Think about the nature of the user's message.
            * Is the user's message a question that I can answer factually within my knowledge domain?
            * Are there any dependencies to external factors that I need to consider before answering the user's question?
            * What are the tools I have at my disposal to help me answer the user's question? 
        2. If the user's message requires a tool response, pick the most suitable tool response from <tools>. 
            * I can refer to the "description" field of each tool to help me decide.
            * For example, if I need to search for real-time information, I can use the "gg" tool and if I know where to find the information, I can use the "scrape" tool.
        3. If the user's message does not require a tool response, provide a text response to the user.

        CONSTRAINTS:        
        1. You can only provide a tool response or a text response and nothing else.
        2. When providing a tool response, respond only in JSON and only pick from <tools>. That means, begin your message with a curly bracket ' and end your message with a curly bracket '. Do not respond with anything else.
        3. Remember, do not invent your own tools. You can only pick from <tools>.
"""

tools = [
    {
        "type": "function",
        "function": {
            "name": "gg",
            "description": "The 'gg' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively."
                    }
                },
                "required": ["search_query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "scrape",
            "description": "Scrape the content of the provided url",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The url to scrape the content from"
                    }
                },
                "required": ["url"]
            }
        }
    }
]


# user_prompt = "Where did PM Lee Hsien Loong hold his 2023 national day rally?"
# user_prompt = "Who is the current president of singapore?"
# user_prompt = "Tell me the latest news on Singapore"
# user_prompt = "Tell me a one paragraph short story."
user_prompt = "What is today's date?"

response = gen.create(messages=[{'role':'system','content':system_prompt.format(tools=json.dumps(tools))},
                                {'role':'user','content':user_prompt},
                                {'role':'assistant','content':''}],
                    stream=False,
                    max_new_tokens=100)
print(response.choices[0].message)


ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_1109f336-a568-4594-a944-6dd2208382cd', function=Function(arguments=' {\n            "search_query": "current date"\n        }', name='gg'), type='function')])


## 1.4 Running as a Service

### Step 1: Start Docker container

In [None]:
%%bash

# Stop any container with the same name
docker rm -f gai-ttt

# Start the container
docker run -d \
    --name gai-ttt \
    -p 12031:12031 \
    --gpus all \
    -v ~/gai/models:/app/models \
    kakkoii1337/gai-ttt:latest

# Wait for model to load
sleep 30

# Confirm its running
docker logs gai-ttt

When the loading is completed, the logs should show this:

```bash
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12031 (Press CTRL+C to quit)
```

### Step 2: Run Text Generation Client

The default model is Mistral7B-8k context size

In [None]:
import requests,json
response = requests.post(
    url='http://localhost:12031/gen/v1/chat/completions', 
    json={
        "model": "mistral7b-exllama",
        "messages": [
            {"role": "user", "content": "Tell me a one paragraph short story."},
            {"role": "assistant", "content": ""}
        ],
        "max_new_tokens": 100,
        "stream": True
    },
    stream=True)
for chunk in response.iter_lines():
    result = json.loads(chunk.decode('utf-8'))
    print(result["choices"][0]["delta"]["content"],end='',flush=True)


In [None]:
import json
system_prompt = """
        You will always begin your interaction by asking yourself if the user's message is a message that requires a tool response or a text response.
                        
        DEFINITIONS:
        1. A tool response is based on a tool chosen from the <tools> list below:
                <tools>
                {tools}
                </tools>.
            
        2. A text response is a normal text response that you would provide to the user, formatted as JSON:
                <text>
                    {{
                        "type": "text",
                        "text": ...
                    }}
                <text>
        
        STEPS:
        1. Think about the nature of the user's message.
            * Is the user's message a question that I can answer factually within my knowledge domain?
            * Are there any dependencies to external factors that I need to consider before answering the user's question?
            * What are the tools I have at my disposal to help me answer the user's question? 
        2. If the user's message requires a tool response, pick the most suitable tool response from <tools>. 
            * I can refer to the "description" field of each tool to help me decide.
            * For example, if I need to search for real-time information, I can use the "gg" tool and if I know where to find the information, I can use the "scrape" tool.
        3. If the user's message does not require a tool response, provide a text response to the user.

        CONSTRAINTS:        
        1. You can only provide a tool response or a text response and nothing else.
        2. When providing a tool response, respond only in JSON and only pick from <tools>. That means, begin your message with a curly bracket ' and end your message with a curly bracket '. Do not respond with anything else.
        3. Remember, do not invent your own tools. You can only pick from <tools>.
"""

tools = [
    {
        "type": "function",
        "function": {
            "name": "gg",
            "description": "Search google based on the provided search query",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "The search query to search google with"
                    }
                },
                "required": ["search_query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "scrape",
            "description": "Scrape the content of the provided url",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The url to scrape the content from"
                    }
                },
                "required": ["url"]
            }
        }
    }
]


user_prompt = "Where did PM Lee Hsien Loong hold his 2023 national day rally?"
#user_prompt = "Tell me the latest news on Singapore"

import requests,json
response = requests.post(
    url='http://localhost:12031/gen/v1/chat/completions', 
    json={
        "model": "mistral7b-exllama",
        "messages": [
            {"role": "system", "content": system_prompt.format(tools=json.dumps(tools))},
            {"role": "user", "content": user_prompt},
            {"role": "assistant", "content": ""}
        ],
        "max_new_tokens": 100,
        "stream": False
    })
if (response.status_code!=200):
    raise Exception(response.text)

print(json.loads(response.text)['choices'][0]['message']['tool_calls'][0]['function']['name'])
print(json.loads(response.text)['choices'][0]['message']['tool_calls'][0]['function']['arguments'])

