# Text-To-Speech inference with my funny API

This notebook will show you how to use the `tts` module as API to make complete inference with your pretrained `Tacotron-2` model as `synthesizer` and pretrained `NVIDIA's Waveglow` as `vocoder`

The API is really easy to use ! Just call the function with your sentence(s) and the model you want to use and that's it ! The code will automatically load the right model and make the whole inference !

You can also associate models to language in the `models/tts/__init__.py` file in order to enable language-based loading (instead of model-based loading) (see 2nd cell for example)

Note that `Vocoder` is loaded as global variable and all `BaseModel` are `singleton` so you can call the function as mny times as you want without reloading models every time !

I left the results so that you can listen to them without re-executing the code ;) Samples are also in the `example_outputs/audios/` folder

PS : I will (*theorically*) **never** share weights for the french demonstration in this notebook but will share other french voices trained on `SIWIS` (single-speaker) and `SIWIS + CommonVoice + VoxForge` (multi-speaker) [1]

Note : it is normal that the 1st generation is relatively slow but the next ones will be faster than real time !

## Steps to reproduce

1. Download model weights (see README.md for links)
2. Unzip weights in `pretrained_models/` directory
3. If you downloaded a french model : go to `models/tts/__init__.py` and update the `pretrained` variable by adding "fr" as key and the model name as value : this will associate the model to the language

Note : the last step is not necessary but then you **must** load model by its name (and not by its language) ;)

### Streaming

This functionality allows you to enter text and get the output directly

In [1]:
from models.tts import tts_stream

# I suggest you to not save audios when streaming because it is slower ;)
# I did it so that you can listen to the result

# PS : the 'goodbye, see you soon !' message at the end is a funny sentence I added
# when you stop the streaming :D
tts_stream(model = 'nvidia_pretrained', directory = 'example_outputs', overwrite = True)


Model restoration...
Initializing submodel : tts_model !
Successfully restored tts_model from pretrained_models/nvidia_pretrained/saving/tts_model.json !
Model nvidia_pretrained initialized successfully !


Enter text to read : Hello World ! This is a demonstration of my funny streaming API for Text To Speech !


Text : Hello World ! This is a demonstration of my funny streaming API for Text To Speech !



Enter text to read : 


Text : Goodbye, see you soon !



### TTS on text

This function generates audios based on the provided text / list of text. 

By default the model will not regenerate sentences if they were already generated but you can change this behavior with the `overwrite` argument to force regeneration. 

In this example you can see model loading with `lang = 'en'` which will load the `nvidia_pretrained` model. In the `models/tts/__init__.py` file I associated by default the `en` language with this model. 

Note on performances : in the `example_waveglow` notebook you can see `1.4 sec generated / sec` with the `tensorflow` implementation of `WaveGlow`. The `pytorch` version achieves `1.7s generated / sec`. Here it achieves `1.4s generated / sec` because I save results (which takes some time). 

In [3]:
from models.tts import tts

text = [
    "Hello world ! Hope you will enjoy this funny API for Text-To-Speech !",
    "If you train new models, do not hesitate to contact me or add it in the available models !"
]

tts(
    text, lang = 'en', directory = 'example_outputs', display = True, 
    overwrite = True, debug = True, tqdm = lambda x: x
)

Total time : 1.962 sec
- Processing time : 0.003 sec
- Inference time : 1.360 sec
- Saving time : 0.599 sec
Text : Hello world ! Hope you will enjoy this funny API for Text-To-Speech !



Text : If you train new models, do not hesitate to contact me or add it in the available models !



10.494 sec generated in 7.365 sec (1.425 sec generated / sec)


[('Hello world ! Hope you will enjoy this funny API for Text-To-Speech !',
  {'audio': 'example_outputs/audios/audio_2.mp3',
   'duree': 4.341179138321995,
   'mels': ['example_outputs/mels/mel_3.npy'],
   'plots': ['example_outputs/plots/plot_3.png'],
   'splitted': ['Hello world ! Hope you will enjoy this funny API for Text-To-Speech !']}),
 ('If you train new models, do not hesitate to contact me or add it in the available models !',
  {'audio': 'example_outputs/audios/audio_3.mp3',
   'duree': 6.15233560090703,
   'mels': ['example_outputs/mels/mel_2.npy'],
   'plots': ['example_outputs/plots/plot_2.png'],
   'splitted': ['If you train new models, do not hesitate to contact me or add it in the available models !']})]

In [2]:
from models.tts import tts

text = [
    "Bonjour à tous ! J'espère que vous allez aimer cette démonstration de voix en français !"
]

tts(
    text, lang = 'fr', directory = 'example_outputs', display = True, 
    overwrite = True, debug = True, tqdm = lambda x: x
)

Total time : 1.230 sec
- Processing time : 0.001 sec
- Inference time : 0.928 sec
- Saving time : 0.301 sec
Texte : Bonjour à tous ! J'espère que vous allez aimer cette démonstration de voix en français !



4.225 sec generated in 3.390 sec (1.246 sec generated / sec)


[("Bonjour à tous ! J'espère que vous allez aimer cette démonstration de voix en français !",
  {'audio': 'example_outputs/audios/audio_4.mp3',
   'duree': 4.225079365079365,
   'mel_files': ['example_outputs/mels/mel_4.npy'],
   'plot_files': ['example_outputs/plots/plot_4.png'],
   'splitted': ["Bonjour à tous ! J'espère que vous allez aimer cette démonstration de voix en français !"]})]