## **Text2Speech T2S API**

This tutorial uses ondewo-t2s-api to:

*   List the possible pipelines that can be used for synthesizing
*   List the possible languages that can be used in the synthesize process
*   List the possible domains
*   Synthesize a text to audio
*   Synthesize a batch of texts to audios
*   Manipulate pipelines (Create, Delete, Update, Get)


The first step is to install and the Ondewo Text to Speech client and the requirements needed to interact with the server



In [1]:
import os
! pwd
! ls
! rm -rf ondewo-t2s-client-python
! git clone https://github.com/ondewo/ondewo-t2s-client-python.git
! cd ondewo-t2s-client-python && git checkout tags/3.1.4 -b 3.1.4 && pip install -r requirements.txt  && git status

/home/fcavallin/ondewo/ondewo-t2s
code_checks    Jenkinsfile		 ondewo-t2s-hifigan
config	       Jenkinsfile.release	 ondewo_t2s_with_certificate.ipynb
data	       jupyter_notebooks	 package
demo.Makefile  LICENSE.md		 README.md
demo_server    Makefile			 reqs.txt
docker	       models			 requirements.txt
evaluation     normalization		 setup.py
grpc_server    notebooks		 tests
images	       ondewo-t2s-client-python  training
inference      ondewo-t2s-glow		 utils
Cloning into 'ondewo-t2s-client-python'...
remote: Enumerating objects: 433, done.[K
remote: Counting objects: 100% (96/96), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 433 (delta 46), reused 63 (delta 29), pack-reused 337[K
Receiving objects: 100% (433/433), 497.33 KiB | 3.60 MiB/s, done.
Resolving deltas: 100% (240/240), done.
fatal: 'tags/3.1.4' is not a commit and a branch '3.1.4' cannot be created from it


In [2]:
os.getcwd()
! cd ondewo-t2s-client-python && python -m pip install .

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/fcavallin/ondewo/ondewo-t2s/ondewo-t2s-client-python
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m


Building wheels for collected packages: ondewo-t2s-client
  Building wheel for ondewo-t2s-client (setup.py) ... [?25ldone
[?25h  Created wheel for ondewo-t2s-client: filename=ondewo_t2s_client-4.0.1-py2.py3-none-any.whl size=16029 sha256=28e0cd954e0dd345bd2e9351282471906170eddec3b01bd6fecda263a53b09b7
  Stored in directory: /tmp/pip-ephem-wheel-cache-c27zmcel/wheels/ed/f8/c1/1a12133f38087de0ae4bd48fe6237afcd91b2889545f2c892f
Successfully built ondewo-t2s-client
Installing collected packages: ondewo-t2s-client
  Attempting uninstall: ondewo-t2s-client
    Found existing installation: ondewo-t2s-client 4.0.1
    Uninstalling ondewo-t2s-client-4.0.1:
      Successfully uninstalled ondewo-t2s-client-4.0.1
Successfully installed ondewo-t2s-client-4.0.1


In [3]:
import os
import io
import numpy
import soundfile as sf
from IPython.display import Audio
import grpc
from ondewo.t2s import text_to_speech_pb2, text_to_speech_pb2_grpc
from google.protobuf.json_format import ParseDict, MessageToDict, MessageToJson

Afterwards, the client's objects classes need to be imported

In [4]:
from ondewo.t2s import text_to_speech_pb2
from ondewo.t2s.client.client import Client
from ondewo.t2s.client.client_config import ClientConfig
from ondewo.t2s.client.services.text_to_speech import Text2Speech
from ondewo.t2s.text_to_speech_pb2 import ListT2sPipelinesRequest, Text2SpeechConfig 
from ondewo.t2s.text_to_speech_pb2 import ListT2sLanguagesRequest, ListT2sDomainsRequest
from ondewo.t2s.text_to_speech_pb2 import T2sPipelineId, RequestConfig, SynthesizeRequest
from ondewo.t2s.text_to_speech_pb2 import BatchSynthesizeRequest

## Connect to the Text to Speech Service

The example below shows how to create a secure channel for a text to speech stub object. When setting *use_secure_channel=True*, a grpc certificate *grpc_cert* is required.

In [5]:
# credentials = grpc.ssl_channel_credentials(root_certificates=cert)

MAX_MESSAGE_LENGTH: int = 60000000
GRPC_HOST: str = "localhost"
GRPC_PORT: str = "50555"
CHANNEL: str = f"{GRPC_HOST}:{GRPC_PORT}"
cert=None

options = [
    ('grpc.max_send_message_length', MAX_MESSAGE_LENGTH),
    ('grpc.max_receive_message_length', MAX_MESSAGE_LENGTH),
]

# channel = grpc.secure_channel(CHANNEL, credentials, options=options)

config: ClientConfig = ClientConfig(
  host=GRPC_HOST,
  port=GRPC_PORT, 
  grpc_cert=cert)
    
print(config)
    
client: Client = Client(config=config, use_secure_channel=False)



ClientConfig(host='localhost', port='50555', grpc_cert=None)


## Get the service information
In order to get the service information, the following method can be executed. This last will retrieve the release version of the service.

In [6]:
client.services.text_to_speech.get_service_info()

version: "3.1.3"

## List all existing text to speech pipelines

All relevant configurations of the text to speech server are defined in a text to speech pipeline. A running server can store several of such configurations at the same time, and the client can chose which one to pick when he/she sends a request to synthesize a text or batch of texts.

The example below shows how to list all available pipelines by calling the *ListT2sPipelines* function, which takes a *ListT2sPipelinesRequest* as an argument and retrieves a *ListT2sPipelinesResponse*.

In [7]:
pipelines = client.services.text_to_speech.list_t2s_pipelines(request=ListT2sPipelinesRequest())
pipelines

pipelines {
  id: "brigitte"
  description {
    language: "en"
    speaker_sex: "female"
    pipeline_owner: "ondewo"
    comments: "trained on public domain dataset Lj_speech"
    speaker_name: "Brigitte"
    domain: "general"
  }
  active: true
  inference {
    type: "composite"
    composite_inference {
      text2mel {
        type: "glow_tts"
        glow_tts {
          batch_size: 5
          use_gpu: true
          length_scale: 1.0
          noise_scale: 0.6669999957084656
          path: "models/glow-tts/brigitte.pth"
          cleaners: "english_cleaners"
          param_config_path: "models/glow-tts/en/config.json"
        }
        glow_tts_triton {
          batch_size: 8
          length_scale: 1.0
          noise_scale: 0.6669999957084656
          cleaners: "english_cleaners"
          max_text_length: 100
          param_config_path: "models/glow-tts/en/config.json"
          triton_url: "localhost:50511"
          triton_model_name: "glow_tts"
        }
      }
   

## List all possible synthesizying languages

A running server can list all possible languages fulfilling specified requirements that can be used to synthesize.

The example below shows how to list all available languages by calling the *ListT2sLanguages* function, which takes a *ListT2sLanguagesRequest* as an argument and retrieves a *ListT2sLanguagesResponse*.


In [8]:
request = ListT2sLanguagesRequest(speaker_sexes=['female'])
response = client.services.text_to_speech.list_t2s_languages(request=request)
response

languages: "en"
languages: "de"

## List all possible domains

A running server can list all possible domains fulfilling specified requirements that can be used to synthesize.

The example below shows how to list all available domains by calling the *ListT2sDomains* function, which takes a *ListT2sDomainsRequest* as an argument and retrieves a *ListT2sDomainsResponse*.


In [9]:
request = ListT2sDomainsRequest(languages=['en'])
response = client.services.text_to_speech.list_t2s_domains(request=request)
response

domains: "general"

# Make a synthesize request to the server

The running server offers a feature for synthesizying a text into a audio. In order to make use of it, the Synthesize method is utilized. This method will receive a SynthesizeRequest and retrieve a SynthesizeResponse.



The following pipelines were chosen to examplify.
Brigitte's voice is choosen for the english voice and Brigitte's voice for german. Therefore, both pipelines need to be asked for to the stab. 

In [10]:
english_pipeline = T2sPipelineId(id='brigitte')

In [11]:
german_pipeline = T2sPipelineId(id='glow_tts&hifi_gan-e976dd6c-2f41-484b-aec2-3e6868d37290')

The example below shows how to synthesize a text into an audio.
1.   A configuration has to be created with a *RequestConfig*, specifying the desired optional parameters.
2.   A request has to be created with a *SynthesizeRequest*, specifying the text to be synthesize and the previously created configuration.
3.   By calling the Synthesize method with the created request, the text is synthesized with the specfified configuration.



The following example was created with Brigitte's voice, so as to get an english speaker pronunciation in the audio for the text synthesized. 

In [12]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id, 
                       length_scale = 1.0)
request = SynthesizeRequest(text="Hi, this is Brigitte. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.", 
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

### Length scale configuration
The attribute length_scale can be finetuned in order to speed up or slow down the audio. 
In the next example the length_scale attribute is set to 0.5 so the retrieved audio will be twice as fast as the original.

In [13]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id, 
                       length_scale = 0.5)
request = SynthesizeRequest(text="Hi, this is Brigitte. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.",
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

Next, the same attribute is being set to 2.0, therefore, the retrieved audio will be half as fast in comparison to the original.

In [14]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id, 
                       length_scale = 2.0)
request = SynthesizeRequest(text="Hi, this is Brigitte. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.",
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

### Audio Format confguration

Another attribute that can be configurated is the audio_format.
Audio Format can the setted to:

- 0 for wav
- 1 for flac
- 2 for caf (Core audio format)
- 3 for mp3
- 4 for acc (Advanced audio coding)
- 5 for ogg
- 6 for wma (Windows media audio)

In the following example, the attribute audio_format is setted to create a wav audio file.

In [15]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id, 
                       audio_format = 0)
request = SynthesizeRequest(text="Hi, this is Brigitte. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.",
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

### Pulse Code Modulation

The pcm attribute represents the number of pulses created for the audio file.
A pcm signal is a sequence of digital audio samples containing the data providing the necessary information to reconstruct the original analog signal.

- 0 for 16 (16  bits per sample)
- 1 for 24 (24  bits per sample)
- 2 for 32 (32  bits per sample)
- 3 for S8
- 4 for U8
- 5 for Float
- 6 for Double

The number of bit per sample affects the quality and size of the retrieved file. As the number of bits per sample increase, so does the size and quality of the file.

In the first example the audio file is generated with the best possible quality and in the second, with the lowest.

In [16]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id, pcm = 0)
request = SynthesizeRequest(text="Hi, this is Brigitte. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.",
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

In [17]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id, pcm = 4)
request = SynthesizeRequest(text="Hi, this is Brigitte. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.",
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

### SSML Tags

When the servie returns a response to a user's request, the user provides text that the Ondewo text-to-speech service converts to speech. Ondewo's voices automatically handles normal punctuation, such as pausing after a period.

However, in some cases you may want additional control over how the chosen voice generates the speech from the text in your response. For example, you may want the text to be spelled as a code or id, or you may want a string of digits read back as a standard telephone number, email or url. The SSML Tag incorporation provides this type of control over the synthesying process.

SSML is a markup language that provides a standard way to mark up text for the generation of synthetic speech. Ondewo's voices supports a subset of the tags defined in the SSML specification. The specific tags supported are: email, phone, urls, spell and spell-with-names.

See how SSML works: https://docs.aws.amazon.com/polly/latest/dg/ssml.html

#### SSML Phone Tag

In [18]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id)
request = SynthesizeRequest(text='Hi, this is Brigitte. Thanks for calling. I am not here at the moment, so please leave a message to the following number <say-as interpret-as="phone">+12354321</say-as>and I will call you back.',
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

#### SSML Email Tag

In [19]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id)
request = SynthesizeRequest(text='Hi, this is Brigitte. Thanks for calling. I am not here at the moment, so please send an email to the following address <say-as interpret-as="email">voices@ondewo.com.at</say-as>',
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

#### SSML Url Tag

In [20]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id)
request = SynthesizeRequest(text='Hi, this is Brigitte. Thanks for calling. I am not here at the moment, so please visit the site <say-as interpret-as="url">https://ondewo.com/en/</say-as>',
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

#### SSML Spell Tag

In [21]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id)
request = SynthesizeRequest(text='My reservation number is <say-as interpret-as="spell">ABC123DEF!</say-as>',
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

#### SSML Spell With Names Tag

In [22]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id)
request = SynthesizeRequest(text='My reservation number is <say-as interpret-as="spell-with-names">ABC123DEF!</say-as>',
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

### Arphabet Phonemes

Most languages, including English, can be described in terms of a set of distinctive sounds, or phonemes. In particular, for American English, there are about 42 phonemes including vowels, diphthongs, semi-vowels and consonants. 

The internationally standard method to represent phonemes is International Phonetic Alphabet (IPA). To enable computer representation of the phonemes, it is convenient to code them as ASCII characters and we can do this with the ARPABET scheme.

For example:

- Bee : {B IY1}
- She : {SH IY1}
- Red : {R EH1 D}
- Sofa : {S OW1 F AH}

In [23]:
config = RequestConfig(t2s_pipeline_id=english_pipeline.id)
request = SynthesizeRequest(text='Hello I am {AE2 L EH0 G Z AE1 N D R AH0}',
                            config=config)
response = client.services.text_to_speech.synthesize(request=request)
display(Audio(response.audio, autoplay=False))

# Make a synthesize request to the server for a batch of texts

The running server offers a feature for synthesizying a batch of texts into audios. In order to make use of it, the BatchSynthesize method is utilized. This method will receive a BatchSynthesizeRequest and retrieve a BatchSynthesizeResponse.

The example below shows how to synthesize a batch of texts into a audios.

1.   A configuration has to be created with a RequestConfig, specifying the desired optional parameters for each text in the batch.
2.   A request has to be created with a SynthesizeRequest, specifying the text to be synthesize and the previously created configuration for each text in the batch with its desired configuration.
3.  By calling the BatchSynthesize with the created request, the text is synthesized with the specfified configuration.

The following pipelines were chosen to examplify

In [24]:
config_1 = RequestConfig(t2s_pipeline_id=english_pipeline.id, length_scale = 1.0, pcm=0, audio_format= 0)
config_2 = RequestConfig(t2s_pipeline_id=english_pipeline.id, length_scale = 0.5, pcm=0, audio_format= 1)

request_1 = SynthesizeRequest(text='Hello', config=config_1)
request_2 = SynthesizeRequest(text='How are you?', config=config_2)

request = BatchSynthesizeRequest(batch_request = [request_1, request_2])

response = client.services.text_to_speech.batch_synthesize(request = request)

In [25]:

for audio_message in response.batch_response:
  display(Audio(audio_message.audio, autoplay=False))

## Get Pipeline

In order to get an specific pipeline configuration the GetT2sPipeline method is used. This method received a T2sPipelineId and retrieves a Text2SpeechConfig.

In [26]:
request = T2sPipelineId(id=german_pipeline.id)
pipeline_config = client.services.text_to_speech.get_t2s_pipeline(request=request)

## Create Pipeline

The server provides a method for creating new pipelines. This can be done with the function CreateT2sPipeline, which receives a Text2SpeechConfig and retrieves a T2sPipelineId.

In [27]:
new_inference_config = pipeline_config.inference
new_inference_config.composite_inference.text2mel.glow_tts.length_scale = 2
request = Text2SpeechConfig(id='brigitte2.0',                           # Pipeline Id
                            description=pipeline_config.description,      # Pipeline Description 
                            active=True,                                  # Pipeline is active or not
                            inference=new_inference_config,               # Pipeline Inference configuration
                            normalization=pipeline_config.normalization,  # Pipeline Normalization Parameters
                            postprocessing=pipeline_config.postprocessing)# Pipeline Postprocessing
new_pipeline_id = client.services.text_to_speech.create_t2s_pipeline(request=request)

## Update Pipeline

The server provides a method to update a pipeline called UpdateT2sPipeline, receiving a pipeline configuration.

In the following example, the retrieved configuration in the previous call is modified and used to update the pipeline.

In [28]:
pipeline_config.inference.composite_inference.text2mel.glow_tts.length_scale = 2

In [29]:
client.services.text_to_speech.update_t2s_pipeline(request=pipeline_config)



## Delete Pipeline

A pipeline can be deleted with the method DeleteT2sPipeline, receiving a pipeline id.

In [30]:
request = T2sPipelineId(id='brigitte2.0')
pipeline_config = client.services.text_to_speech.get_t2s_pipeline(request=request)
client.services.text_to_speech.delete_t2s_pipeline(request=request)

