## **Text2Speech T2S API**

This tutorial uses ondewo-t2s-api to:

*   List the possible pipelines that can be used for synthesizing
*   List the possible languages that can be used in the synthesize process
*   List the possible domains
*   Synthesize a text to audio
*   Synthesize a batch of texts to audios
*   Manipulate pipelines (Create, Delete, Update, Get)



In [1]:
import os
import io
import soundfile as sf
import IPython.display as ipd
import grpc
from ondewo_grpc.ondewo.t2s import text_to_speech_pb2, text_to_speech_pb2_grpc
import google.protobuf.empty_pb2 as empty_pb2
from google.protobuf.json_format import ParseDict, MessageToDict, MessageToJson

The example below shows how to create a secure channel for a text to speech stub object. When setting use_secure_channel=True, a grpc certificate grpc_cert is required.

In [3]:
MAX_MESSAGE_LENGTH: int = 60000000
GRPC_HOST: str = "localhost" #"GRPC SERVER HERE>"
GRPC_PORT: str = "50555" #"<ADD GRPC PORT HERE>"
CHANNEL: str = f"{GRPC_HOST}:{GRPC_PORT}"
grpc_cert: str = None #"<ADD CERTIFICATE HERE>"
credentials = grpc.ssl_channel_credentials(root_certificates=grpc_cert)

options = [
    ('grpc.max_send_message_length', MAX_MESSAGE_LENGTH),
    ('grpc.max_receive_message_length', MAX_MESSAGE_LENGTH),
]


#channel = grpc.secure_channel(CHANNEL, credentials=credentials, options=options)
channel = grpc.insecure_channel(CHANNEL, options=options)

stub = text_to_speech_pb2_grpc.Text2SpeechStub(channel=channel)


## List all existing text to speech pipelines

All relevant configurations of the text to speech server are defined in a text to speech pipeline. A running server can store several of such configurations at the same time, and the client can chose which one to pick when he/she sends a request to synthesize a text or batch of texts.

The example below shows how to list all available pipelines by calling the ListT2sPipelines function, which takes a ListT2sPipelinesRequest as an argument and retrieves a ListT2sPipelinesResponse.

In [4]:
pipelines = stub.ListT2sPipelines(request=empty_pb2.Empty()).pipelines
for pipeline in pipelines:
    print(pipeline.id)

babette
thomas
eric
matteo
roxana
danielle
glow_tts&hifi_gan-e976dd6c-2f41-484b-aec2-3e6868d37290
clara_it
brigitte
jenny
kerstin
elviras
samuel
roberto


## List all possible synthesizying languages

A running server can list all possible languages fulfilling specified requirements that can be used to synthesize.

The example below shows how to list all available languages by calling the ListT2sLanguages function, which takes a ListT2sLanguagesRequest as an argument and retrieves a ListT2sLanguagesResponse.


In [5]:
request = text_to_speech_pb2.ListT2sLanguagesRequest(speaker_sexes=['male'])
response = stub.ListT2sLanguages(request=request)

## List all possible domains

A running server can list all possible domains fulfilling specified requirements that can be used to synthesize.

The example below shows how to list all available domains by calling the ListT2sDomains function, which takes a ListT2sDomainsRequest as an argument and retrieves a ListT2sDomainsResponse.


In [6]:
request = text_to_speech_pb2.ListT2sDomainsRequest(languages=['de'])
response = stub.ListT2sDomains(request=request)

# Make a synthesize request to the server

The running server offers a feature for synthesizying a text into a audio. In order to make use of it, the Synthesize method is utilized. This method will receive a SynthesizeRequest and retrieve a SynthesizeResponse.



The following pipelines were chosen to examplify

In [7]:
german_pipeline = text_to_speech_pb2.T2sPipelineId(id='brigitte')
german_pipeline

id: "brigitte"

The example below shows how to synthesize a text into an audio.
1.   A configuration has to be created with a RequestConfig, specifying the desired optional parameters.
2.   A request has to be created with a SynthesizeRequest, specifying the text to be synthesize and the previously created configuration.
3.   By calling the Synthesize method with the created request, the text is synthesized with the specfified configuration.



In [8]:
def print_single_info(single_response):
        print(f"Info:")
        print(f"audio_uuid: {single_response.audio_uuid}")
        print(f"generation_time: {single_response.generation_time}")
        print(f"audio_length: {single_response.audio_length}")
        print(f"text: {single_response.text}")
        print(f"config: {single_response.config}")
        bio = io.BytesIO(single_response.audio)
        audio = sf.read(bio, )
        ipd.Audio(audio[0], rate=audio[1])

def print_batch_info(response):
    for idx, single_response in enumerate(response.batch_response):
        print(f"AUDIO {idx}")
        print_single_info(single_response)
        
def find_pipeline_for_language(pipelines, language):
    for pipeline in pipelines:
        if pipeline.description.language == language:
            return pipeline


"Hallo Ich bin der liebe Thomas" 0.273775339126586
"Hallo Ich bin der liebe Thomas" 0.1190333366394043

"Hallo Ich bin der liebe Thomas der gerne lange Geschichten \
erzählt über die Germanen die damals vor langer Zeit Krieg \
gegen die Römer geführt haben, das war sehr grausam." 0.5342330932617188
"Hallo Ich bin der liebe Thomas der gerne lange Geschichten \
erzählt über die Germanen die damals vor langer Zeit Krieg \
gegen die Römer geführt haben, das war sehr grausam." 0.3341314792633056

"Hallo Ich bin der liebe alte böse\
Thomas der gerne lange Geschichten erzählt über die Germanen die damals vor langer Zeit\
Krieg gegen die Römer geführt haben, das war sehr grausam." 1.2942924499511719
"Hallo Ich bin der liebe alte böse\
Thomas der gerne lange Geschichten erzählt über die Germanen die damals vor langer Zeit\
Krieg gegen die Römer geführt haben, das war sehr grausam." 0.43032121658325195

"Hallo Ich bin der liebe Thomas" 0.3303382396697998
"Hallo Ich bin der liebe Thomas" 0.1962285041809082

"Hallo Ich bin der liebe Thomas der gerne lange Geschichten \
erzählt über die Germanen die damals vor langer Zeit Krieg \
gegen die Römer geführt haben, das war sehr grausam." 0.7013823986053467
"Hallo Ich bin der liebe Thomas der gerne lange Geschichten \
erzählt über die Germanen die damals vor langer Zeit Krieg \
gegen die Römer geführt haben, das war sehr grausam." 0.5884361267089844

"Hallo Ich bin der liebe alte böse\
Thomas der gerne lange Geschichten erzählt über die Germanen die damals vor langer Zeit\
Krieg gegen die Römer geführt haben, das war sehr grausam." 0.7190685272216797
"Hallo Ich bin der liebe alte böse\
Thomas der gerne lange Geschichten erzählt über die Germanen die damals vor langer Zeit\
Krieg gegen die Römer geführt haben, das war sehr grausam."  0.5857870578765869

In [18]:
import time
config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id='brigitte')
#request = text_to_speech_pb2.SynthesizeRequest(text='{K A}', config=config) # 0
#request = text_to_speech_pb2.SynthesizeRequest(text='Oi, como vai? Eu sou Babette. tenho vinte e oito anos', config=config) # 0
#request = text_to_speech_pb2.SynthesizeRequest(text='Wennst kein Bayer bist, dann kriegst auch kein Weissbier!', config=config) # 0
#request = text_to_speech_pb2.SynthesizeRequest(text='I bin doch ned auf da Brennsuppn daher gschwomma', config=config) # 0
request = text_to_speech_pb2.SynthesizeRequest(text='Hello Im Brigitte 6 4 9 200', config=config) # 0

t0 = time.time()
response = stub.Synthesize(request=request)
t1 = time.time()
print(f"Responded in {t1-t0}s")
print_single_info(response)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1], autoplay=True)

Responded in 0.411285400390625s
Info:
audio_uuid: 9ac2f14c-ecb3-49c9-82b3-51dd45897269
generation_time: 0.40965506434440613
audio_length: 4.260861873626709
text: Hello Im Brigitte 6 4 9 200
config: t2s_pipeline_id: "brigitte"
length_scale: 1.0
noise_scale: 0.6669999957084656
sample_rate: 22050
pcm: PCM_16
audio_format: wav
use_cache: false



In [9]:
import time

config = text_to_speech_pb2.RequestConfig(t2s_pipeline_id='babette')
request = text_to_speech_pb2.SynthesizeRequest(text='Meine Nummer ist 82LAB12', config=config)
#request = text_to_speech_pb2.SynthesizeRequest(text='aaah', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='beehh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='zeeh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='deeehh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='eeeeeehh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='eeff', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='geeeh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='haaa', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='iiih', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='jeeeh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='kaaah', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='eelll', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='eehSH A R F EH S} {EH S Smmm', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='eennnn', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='ooohh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='peehh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='queeeh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='ehrrh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='eessss', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='teeeh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='uuuu', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='vauh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='weeeh', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='ixxx', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='ypsilon', config=config)
# request = text_to_speech_pb2.SynthesizeRequest(text='tsett', config=config)

# request = text_to_speech_pb2.SynthesizeRequest(text='umlaut eeehh', config=config) # ä
# request = text_to_speech_pb2.SynthesizeRequest(text='umlaut öh', config=config) # ö
# request = text_to_speech_pb2.SynthesizeRequest(text='umlaut üh', config=config) # ü
# request = text_to_speech_pb2.SynthesizeRequest(text='scharfes eesss', config=config) # ß
# request = text_to_speech_pb2.SynthesizeRequest(text='Bindestrich', config=config) # -
# request = text_to_speech_pb2.SynthesizeRequest(text='Komma', config=config) # ,
# request = text_to_speech_pb2.SynthesizeRequest(text='Punkt', config=config) # .
# request = text_to_speech_pb2.SynthesizeRequest(text='Schrägstrich', config=config) # /
# request = text_to_speech_pb2.SynthesizeRequest(text='Ausrufezeichen', config=config) # !
# request = text_to_speech_pb2.SynthesizeRequest(text='Fragezeichen', config=config) # ?
# request = text_to_speech_pb2.SynthesizeRequest(text='Gleichheitszeichen', config=config) # =
# request = text_to_speech_pb2.SynthesizeRequest(text='Klammer auf', config=config) # (
# request = text_to_speech_pb2.SynthesizeRequest(text='Klammer zu', config=config) # )
# request = text_to_speech_pb2.SynthesizeRequest(text='Logisches und', config=config) # &
# request = text_to_speech_pb2.SynthesizeRequest(text='Prozent', config=config) # %
# request = text_to_speech_pb2.SynthesizeRequest(text='Paragraff', config=config) # §

# request = text_to_speech_pb2.SynthesizeRequest(text='eins', config=config) # 1
#request = text_to_speech_pb2.SynthesizeRequest(text='zwei', config=config) # 2
#request = text_to_speech_pb2.SynthesizeRequest(text='drei', config=config) # 3
#request = text_to_speech_pb2.SynthesizeRequest(text='vier', config=config) # 4
#request = text_to_speech_pb2.SynthesizeRequest(text='fünff', config=config) # 5
#request = text_to_speech_pb2.SynthesizeRequest(text='sechss', config=config) # 6
#request = text_to_speech_pb2.SynthesizeRequest(text='sieben', config=config) # 7
#request = text_to_speech_pb2.SynthesizeRequest(text='achttt', config=config) # 8
#request = text_to_speech_pb2.SynthesizeRequest(text='nouin', config=config) # 9



t0 = time.time()
response = stub.Synthesize(request=request)
t1 = time.time()
print(f"Responded in {t1-t0}s")
print_single_info(response)
bio = io.BytesIO(response.audio)
audio = sf.read(bio, )
ipd.Audio(audio[0], rate=audio[1], autoplay=True)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: Model set with model id babette is not registered in ModelManager. Available ids for model sets are ['alexandra', 'alexandra']"
	debug_error_string = "{"created":"@1648195392.593089104","description":"Error received from peer ipv4:192.168.1.202:50123","file":"src/core/lib/surface/call.cc","file_line":1064,"grpc_message":"Exception calling application: Model set with model id babette is not registered in ModelManager. Available ids for model sets are ['alexandra', 'alexandra']","grpc_status":2}"
>

# Make a synthesize request to the server for a batch of texts

The running server offers a feature for synthesizying a batch of texts into audios. In order to make use of it, the BatchSynthesize method is utilized. This method will receive a BatchSynthesizeRequest and retrieve a BatchSynthesizeResponse.

The example below shows how to synthesize a batch of texts into a audios.

1.   A configuration has to be created with a RequestConfig, specifying the desired optional parameters for each text in the batch.
2.   A request has to be created with a SynthesizeRequest, specifying the text to be synthesize and the previously created configuration for each text in the batch with its desired configuration.
3.  By calling the BatchSynthesize with the created request, the text is synthesized with the specfified configuration.

In [None]:
config_1 = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=german_id, length_scale = 1.0, pcm=0, audio_format= 0)
config_2 = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=german_id, length_scale = 0.5, pcm=0, audio_format= 1)
config_3 = text_to_speech_pb2.RequestConfig(t2s_pipeline_id=german_id, length_scale = 1.0, pcm=1, audio_format= 0)

request_1 = text_to_speech_pb2.SynthesizeRequest(text='Hello', config=config_1)
request_2 = text_to_speech_pb2.SynthesizeRequest(text='How are you?', config=config_2)
request_3 = text_to_speech_pb2.SynthesizeRequest(text='Hallo, wie geht es dir?', config=config_3)

request = text_to_speech_pb2.BatchSynthesizeRequest(batch_request = [request_1, request_2, request_3])

response = stub.BatchSynthesize(request = request)

## Create Pipeline

The server provides a method for creating new pipelines. This can be done with the function CreateT2sPipeline, which receives a Text2SpeechConfig and retrieves a T2sPipelineId.

In [None]:
request = text_to_speech_pb2.Text2SpeechConfig(id=german_id, description='description', active=True, inference='inference', normalization='normalization', postprocessing='postprocessing')
new_pipeline_id = stub.CreateT2sPipeline(request=request)

## Get Pipeline

In order to get an specific pipeline configuration the GetT2sPipeline method is used. This method received a T2sPipelineId and retrieves a Text2SpeechConfig.

In [None]:
request = text_to_speech_pb2.T2sPipelineId(id='new_thomas_trimmed')
pipeline_config = stub.GetT2sPipeline(request=request)
pipeline_config

## Update Pipeline

The server provides a method to update a pipeline called UpdateT2sPipeline, receiving a pipeline configuration.

In the following example, the retrieved configuration in the previous call is modified and used to update the pipeline.

pipeline_config.inference.composite_inference.text2mel.glow_tts.length_scale = 2

In [None]:
stub.UpdateT2sPipeline(request=pipeline_config)

## Delete Pipeline

A pipeline can be deleted with the method DeleteT2sPipeline, receiving a pipeline id.

In [None]:
stub.DeleteT2sPipeline(request=request)