# Gai/Gen: Text-to-Speech (TTS)

## 1. Note

The following examples has been tested on the following environment:

-   NVidia GeForce RTX 2060 6GB
-   Windows 11 + WSL2
-   Ubuntu 22.04
-   Python 3.10
-   CUDA Toolkit 11.8
-   openai 1.6.1
-   TTS 0.22.0
-   deepspeed 0.12.6


## 2. Create Virtual Environment and Install Dependencies

We will create a seperate virtual environment for this to avoid conflicting dependencies that each underlying model requires.

```sh
sudo apt update -y && sudo apt install ffmpeg git git-lfs -y
conda create -n TTS python=3.10.10 -y
conda activate TTS
pip install ".[TTS]"
```



## 3. Install Model

The following demo is uses Coqui AI's xTTS model. Create and run the following script `xtts_download.py` to download the model:

```python
# xtts_download.py
import os
os.environ["COQUI_TOS_AGREED"]="1"

from TTS.utils.manage import ModelManager
print("Downloading...")
mm =  ModelManager(output_prefix="~/gai/models/tts")
model_name="tts_models/multilingual/multi-dataset/xtts_v2"
mm.download_model(model_name)
print("Downloaded")
```

Take note that loading the model for the first time will take a while for deepspeed to compile the model.

## 4. Examples

In [1]:
## 3.1 OpenAI Text-to-Speech

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('openai-tts-1')
response = gen.create(
  voice="alloy",
  input="The definition of insanity is doing the same thing over and over and expecting different results."
)
from IPython.display import Audio
Audio(response, rate=24000)

GENERATING:
Loading TTS...


In [2]:
## 3.2 Coqui xTTS Text-to-Speech

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('xtts-2')
response = gen.create(
  voice="Vjollca Johnnie",
  input="The definition of insanity is doing the same thing over and over and expecting different results."
)
from IPython.display import Audio
Audio(response, rate=24000)

GENERATING:
Loading TTS...
Loading XTTS...


  from .autonotebook import tqdm as notebook_tqdm


[2024-01-18 17:38:10,397] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-18 17:38:11,214] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2024-01-18 17:38:11,217] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1


Using /home/roylai/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/roylai/.cache/torch_extensions/py310_cu121/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.
Time to load transformer_inference op: 0.11158943176269531 seconds
[2024-01-18 17:38:12,245] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode':

Loading extension module transformer_inference...


------------------------------------------------------
Free memory : 4.624023 (GigaBytes)  
Total memory: 7.999573 (GigaBytes)  
Requested memory: 0.335938 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x79d000000 
------------------------------------------------------


## Running as a Service

#### Step 1: Start Docker container

```bash
docker run -d \
    --name gai-tts \
    -p 12031:12031 \
    --gpus all \
    -v ~/gai/models:/app/models \
    kakkoii1337/gai-tts:latest
```

#### Step 2: Wait for model to load

```bash
docker logs gai-tts
```

When the loading is completed, the logs should show this:

```bash
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12031 (Press CTRL+C to quit)
```

#### Step 3: Test


In [None]:
%%bash
curl -X POST http://localhost:12031/gen/v1/audio/speech \
    -H "Content-Type: application/json" \
    -N \
    -d "{\"model\":\"xtts-2\",\"input\":\"I think there is no direct bus. You can take 185 and change to MRT at buona vista. 185 should be arriving in 5 minutes.\", \"stream\":true}" | ffplay -autoexit -nodisp -hide_banner -
