## **Text2Speech T2S API**

This tutorial uses ondewo-t2s-api to:

*   List the possible pipelines that can be used for synthesizing
*   List the possible languages that can be used in the synthesize process
*   List the possible domains
*   Synthesize a text to audio
*   Synthesize a batch of texts to audios
*   Manipulate pipelines (Create, Delete, Update, Get)



In [1]:
import os
! cd .. && python -m pip install .

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/fcavallin/ondewo/ondewo-t2s-client-python
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m


Building wheels for collected packages: ondewo-t2s-client
  Building wheel for ondewo-t2s-client (setup.py) ... [?25ldone
[?25h  Created wheel for ondewo-t2s-client: filename=ondewo_t2s_client-3.0.0-py2.py3-none-any.whl size=22002 sha256=d16169f5e5476b40eb4c840bf7769da19fcd96293e15fbafbde241e5d16eb30e
  Stored in directory: /tmp/pip-ephem-wheel-cache-t2fzt9hm/wheels/5e/5c/15/a95814f8cdadbab7525b51c1ac726b1f38913f1fda48abdc7e
Successfully built ondewo-t2s-client
Installing collected packages: ondewo-t2s-client
Successfully installed ondewo-t2s-client-3.0.0


In [7]:
import os
import io
import soundfile as sf
import IPython.display as ipd
import grpc
from ondewo.t2s import text_to_speech_pb2, text_to_speech_pb2_grpc
import google.protobuf.empty_pb2 as empty_pb2
from google.protobuf.json_format import ParseDict, MessageToDict, MessageToJson
from utils import play

## Connect to the Text to Speech Service

The example below shows how to create a secure channel for a text to speech stub object. When setting *use_secure_channel=True*, a grpc certificate *grpc_cert* is required.

In [8]:
MAX_MESSAGE_LENGTH: int = 60000000
GRPC_HOST: str = "dgxstation" #"<ADD GRPC SERVER HERE>"
GRPC_PORT: str = "50557" #"<ADD GRPC PORT HERE>"
CHANNEL: str = f"{GRPC_HOST}:{GRPC_PORT}"
grpc_cert: str = None #"<ADD CERTIFICATE HERE>"
credentials = grpc.ssl_channel_credentials(root_certificates=grpc_cert)

options = [
    ('grpc.max_send_message_length', MAX_MESSAGE_LENGTH),
    ('grpc.max_receive_message_length', MAX_MESSAGE_LENGTH),
]


# channel = grpc.secure_channel(CHANNEL, credentials=credentials, options=options)
channel = grpc.insecure_channel(CHANNEL, options=options)

stub = text_to_speech_pb2_grpc.Text2SpeechStub(channel=channel)


## Get the service information
In order to get the service information, the following method can be executed. This last will retrieve the release version of the service.

In [4]:
stub.GetServiceInfo(empty_pb2.Empty())

version: "1.5.0"

## List all existing text to speech pipelines

All relevant configurations of the text to speech server are defined in a text to speech pipeline. A running server can store several of such configurations at the same time, and the client can chose which one to pick when he/she sends a request to synthesize a text or batch of texts.

The example below shows how to list all available pipelines by calling the *ListT2sPipelines* function, which takes a *ListT2sPipelinesRequest* as an argument and retrieves a *ListT2sPipelinesResponse*.

In [9]:
pipelines = stub.ListT2sPipelines(request=empty_pb2.Empty()).pipelines
pipelines

[id: "linda"
description {
  language: "en"
  speaker_sex: "female"
  pipeline_owner: "ondewo"
  comments: "trained on public domain dataset"
  speaker_name: "Linda"
  domain: "general"
}
active: true
inference {
  type: "composite"
  composite_inference {
    text2mel {
      type: "glow_tts"
      glow_tts {
        batch_size: 5
        use_gpu: true
        length_scale: 1.0
        noise_scale: 0.6669999957084656
        path: "models/glow-tts/linda_blank.pth"
        param_config_path: "models/glow-tts/config_blank_en.json"
      }
      glow_tts_triton {
        batch_size: 8
        length_scale: 1.0
        noise_scale: 0.6669999957084656
        max_text_length: 100
        param_config_path: "models/ondewo/de-DE/general001/kk001/0.0.1/glow-tts/de/config_blank.json"
        triton_url: "localhost:50511"
        triton_model_name: "glow_tts"
      }
    }
    mel2audio {
      type: "hifi_gan"
      mb_melgan_triton {
        config_path: "models/ondewo/de-DE/general001/tm001/

## List all possible synthesizying languages

A running server can list all possible languages fulfilling specified requirements that can be used to synthesize.

The example below shows how to list all available languages by calling the *ListT2sLanguages* function, which takes a *ListT2sLanguagesRequest* as an argument and retrieves a *ListT2sLanguagesResponse*.


In [10]:
request = text_to_speech_pb2.ListT2sLanguagesRequest(speaker_sexes=['female'])
response = stub.ListT2sLanguages(request=request)
response

languages: "de"
languages: "en"

## List all possible domains

A running server can list all possible domains fulfilling specified requirements that can be used to synthesize.

The example below shows how to list all available domains by calling the *ListT2sDomains* function, which takes a *ListT2sDomainsRequest* as an argument and retrieves a *ListT2sDomainsResponse*.


In [11]:
request = text_to_speech_pb2.ListT2sDomainsRequest(languages=['en'])
response = stub.ListT2sDomains(request=request)
response

domains: "general"

# Make a synthesize request to the server

The running server offers a feature for synthesizying a text into a audio. In order to make use of it, the Synthesize method is utilized. This method will receive a SynthesizeRequest and retrieve a SynthesizeResponse.



The following pipelines were chosen to examplify.
Linda's voice is choosen for the english voice and Alexandra's voice for german. Therefore, both pipelines need to be asked for to the stab. 

In [12]:
english_pipeline = text_to_speech_pb2.T2sPipelineId(id='linda')
english_pipeline

id: "linda"

In [13]:
german_pipeline = text_to_speech_pb2.T2sPipelineId(id='alexandra')
german_pipeline

id: "alexandra"

The example below shows how to synthesize a text into an audio.
1.   A configuration has to be created with a *RequestConfig*, specifying the desired optional parameters.
2.   A request has to be created with a *SynthesizeRequest*, specifying the text to be synthesize and the previously created configuration.
3.   By calling the Synthesize method with the created request, the text is synthesized with the specfified configuration.



The following example was created with Alexandra's voice, so as to get an english speaker pronunciation in the audio for the text synthesized. 

In [None]:
config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=english_pipeline.id)
request = text_to_speech_pb2.SynthesizeRequest(text="Hi, this is Alexandra. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.", config=config)
response = stub.Synthesize(request=request)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1])

### Length scale configuration
The attribute length_scale can be finetuned in order to speed up or slow down the audio. 
In the next example the length_scale attribute is set to 0.5 so the retrieved audio will be twice as fast as the original.

In [None]:
config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=english_pipeline.id, length_scale = 0.5)
request = text_to_speech_pb2.SynthesizeRequest(text="Hi, this is Alexandra. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.", config=config)
response = stub.Synthesize(request=request)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1])

Next, the same attribute is being set to 2.0, therefore, the retrieved audio will be half as fast in comparison to the original.

In [None]:
config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=english_pipeline.id, length_scale = 2.0)
request = text_to_speech_pb2.SynthesizeRequest(text="Hi, this is Alexandra. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.", config=config)
response = stub.Synthesize(request=request)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1])

### Audio Format confguration

Another attribute that can be configurated is the audio_format.
Audio Format can the setted to:

- 0 for wav
- 1 for flac
- 2 for caf (Core audio format)
- 3 for mp3
- 4 for acc (Advanced audio coding)
- 5 for ogg
- 6 for wma (Windows media audio)

In the following example, the attribute audio_format is setted to create a wav audio file.

In [None]:
config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=english_pipeline.id, audio_format= 0)
request = text_to_speech_pb2.SynthesizeRequest(text="Hi, this is Alexandra. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.", config=config)
response = stub.Synthesize(request=request)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1])

### Pulse Code Modulation

The pcm attribute represents the number of pulses created for the audio file.
A pcm signal is a sequence of digital audio samples containing the data providing the necessary information to reconstruct the original analog signal.

- 0 for 16 (16  bits per sample)
- 1 for 24 (24  bits per sample)
- 2 for 32 (32  bits per sample)
- 3 for S8
- 4 for U8
- 5 for Float
- 6 for Double

The number of bit per sample affects the quality and size of the retrieved file. As the number of bits per sample increase, so does the size and quality of the file.

In the first example the audio file is generated with the best possible quality and in the second, with the lowest.

In [None]:
config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=english_pipeline.id, pcm=0)
request = text_to_speech_pb2.SynthesizeRequest(text="Hi, this is Alexandra. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.", config=config)
response = stub.Synthesize(request=request)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1])

In [None]:
config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=english_pipeline.id, pcm=4)
request = text_to_speech_pb2.SynthesizeRequest(text="Hi, this is Alexandra. Thanks for calling. I'm not here at the moment, so please leave a message and I'll call you back.", config=config)
response = stub.Synthesize(request=request)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1])

# Make a synthesize request to the server for a batch of texts

The running server offers a feature for synthesizying a batch of texts into audios. In order to make use of it, the BatchSynthesize method is utilized. This method will receive a BatchSynthesizeRequest and retrieve a BatchSynthesizeResponse.

The example below shows how to synthesize a batch of texts into a audios.

1.   A configuration has to be created with a RequestConfig, specifying the desired optional parameters for each text in the batch.
2.   A request has to be created with a SynthesizeRequest, specifying the text to be synthesize and the previously created configuration for each text in the batch with its desired configuration.
3.  By calling the BatchSynthesize with the created request, the text is synthesized with the specfified configuration.

In [None]:
config_1 = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=english_pipeline.id, length_scale = 1.0, pcm=0, audio_format= 0)
config_2 = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=german_pipeline.id, length_scale = 1.0, pcm=0, audio_format= 1)
config_3 = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=german_pipeline.id, length_scale = 1.0, pcm=1, audio_format= 0)

request_1 = text_to_speech_pb2.SynthesizeRequest(text="Thank you for your response. We will be waiting you for your appointment with Doctor Smith on Tuesday the first of October at ten in the afternoon.", config=config_1)
request_2 = text_to_speech_pb2.SynthesizeRequest(text="Danke für Ihre Antwort.", config=config_2)
request_3 = text_to_speech_pb2.SynthesizeRequest(text="Wir erwarten Sie am Dienstag, den 1. Oktober, um 10 Uhr nachmittags zu Ihrem Termin bei Dr. Smith.", config=config_3)

request = text_to_speech_pb2.BatchSynthesizeRequest(batch_request = [request_1, request_2, request_3])

response = stub.BatchSynthesize(request = request)

In [None]:
for message in response.batch_response:
    bio = io.BytesIO(message.audio)
    audio = sf.read(bio, )
    display(ipd.Audio(audio[0], rate=audio[1], autoplay=False))

## Get Pipeline

In order to get an specific pipeline configuration the GetT2sPipeline method is used. This method received a T2sPipelineId and retrieves a Text2SpeechConfig.

In [None]:
request = text_to_speech_pb2.T2sPipelineId(id=german_pipeline.id)
pipeline_config = stub.GetT2sPipeline(request=request)

## Create Pipeline

The server provides a method for creating new pipelines. This can be done with the function CreateT2sPipeline, which receives a Text2SpeechConfig and retrieves a T2sPipelineId.

In [None]:
new_inference_config = pipeline_config.inference
new_inference_config.composite_inference.text2mel.glow_tts.length_scale = 2
request = text_to_speech_pb2.Text2SpeechConfig(id='alexandra_2.0', description=pipeline_config.description, active=True, inference=new_inference_config, normalization=pipeline_config.normalization, postprocessing=pipeline_config.postprocessing)
new_pipeline_id = stub.CreateT2sPipeline(request=request)

## Update Pipeline

The server provides a method to update a pipeline called UpdateT2sPipeline, receiving a pipeline configuration.

In the following example, the retrieved configuration in the previous call is modified and used to update the pipeline.

In [None]:
pipeline_config.inference.composite_inference.text2mel.glow_tts.length_scale = 2

In [None]:
stub.UpdateT2sPipeline(request=pipeline_config)

## Delete Pipeline

A pipeline can be deleted with the method DeleteT2sPipeline, receiving a pipeline id.

In [None]:
request = text_to_speech_pb2.T2sPipelineId(id='alexandra_2.0')
pipeline_config = stub.GetT2sPipeline(request=request)
stub.DeleteT2sPipeline(request=request)