It is important to establish some kind of quality assurance for the generated dubs. To do so, we will generate a large amount of dubs, and then test them against Rhasspy - we use Rhasspy anyway in our voice assistant, so it's a quick win.

First, let's define a big-enough set of templates - some phrases every home assistant will have to understand eventually.

In [1]:
from demos.persona.intents import Intents

[c.name for c in Intents.get_templates()]

['demos.persona.intents.Intents.yes',
 'demos.persona.intents.Intents.no',
 'demos.persona.intents.Intents.weather',
 'demos.persona.intents.Intents.time',
 'demos.persona.intents.Intents.date',
 'demos.persona.intents.Intents.transport',
 'demos.persona.intents.Intents.timer_create',
 'demos.persona.intents.Intents.timer_how_much_time',
 'demos.persona.intents.Intents.timer_how_many_timers',
 'demos.persona.intents.Intents.timer_cancel',
 'demos.persona.intents.Intents.spotify',
 'demos.persona.intents.Intents.cook']

In [2]:
import datetime

batch = f'intent_voicing'
batch

'intent_voicing'

We will create dubbing for all these intents just the way we did before

In [3]:
from kaia.persona.dub.languages.en import get_predefined_dubs, DubbingTaskCreator
from kaia.brainbox import BrainBox

ADDRESS = 'http://192.168.178.50'
box = BrainBox()
box_api = box.create_api(ADDRESS)

def create_tasks():
    voice = box.settings.tortoise_tts.test_voice
    tc = DubbingTaskCreator()
    sequences = tc.fragment(get_predefined_dubs(), Intents.get_templates(), voice)
    optimized_sequences = tc.optimize_sequences(sequences)
    dub_and_cut_tasks = tc.create_dub_and_cut_tasks(optimized_sequences)
    bb_tasks = tc.create_tasks(dub_and_cut_tasks,'TortoiseTTS','aligned_dub',batch)
    return bb_tasks


def add_tasks(bb_tasks):
    for task in bb_tasks:
        box_api.add_task(task)

#add_tasks(create_tasks())

And then download results and encode them

In [4]:
from kaia.infra import Loc
from ipywidgets import Audio, VBox
from kaia.persona.dub.languages.en import DubbingPack
from pathlib import Path


pack_path =  Path('files/intent_dubbing.zip')
host_path =  Loc.temp_folder/'demos/dubbing/intent_dubbing'


def download_pack(recode = False):
    target_task = [t for t in box_api.get_tasks(batch) if t['back_track'] == 'Dubbing'][-1]
    print(target_task['received_timestamp'])
    result = box_api.get_result(target_task['id'])
    if result is None:
        raise ValueError('Not yet ready')
    box_api.download(result, pack_path, True)

#download_pack(True)

Let's check that it works

In [5]:
pack = DubbingPack.from_zip(host_path, pack_path)
template = Intents.timer_create
s = template.to_str(template.get_random_value())
print(s)
audios = []
for i in range(3):
    fname = pack.create_dubber(option_index=i).dub_string(s, template)
    audios.append(Audio.from_file(fname, autoplay = False))
VBox(audios)

Set the timer for one hour, three minutes and one second


VBox(children=(Audio(value=b'RIFF\x0c\x14\x03\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb…

Now, let's perform testing:

In [6]:
from kaia.persona.dub.languages.en import TestingTools

test = TestingTools(Intents.get_templates(), 100)

In [7]:
from kaia.persona.dub.core import RhasspyAPI
import pandas as pd

def make_test():
    api = RhasspyAPI.create('http://127.0.0.1:12101', Intents.get_templates())
    api.train()
    dfs = []
    for i in range(3):
        df = TestingTools.samples_to_df(test.test_voice(pack.create_dubber(option_index=i), api))
        df['option_index'] = i
        dfs.append(df)

    df = pd.concat(dfs)
    df.to_parquet('files/test_on_intents.parquet')

#make_test()

In [8]:
df = pd.read_parquet('files/test_on_intents.parquet')
df.groupby(['option_index'])[['match','match_values','match_keys','match_intent']].mean()

Unnamed: 0_level_0,match,match_values,match_keys,match_intent
option_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.787097,0.787097,1.0,1.0
1,0.729032,0.729032,1.0,1.0
2,0.819355,0.819355,1.0,1.0


The results are, again, decent. At least it never misses the intents, and provide completely correct interpretation in 70-80% of cases.

The experiment also traces audiofiles that were used to compose the utterance. We can now see which fragments are particulatly problematic:

In [9]:
from yo_fluq_ds import *

fdf = Query.df(df).select_many(lambda z: ((z.match, dec) for dec in z.decomposition)).to_dataframe(columns=['match','fragment'])
fdf = fdf.merge(pack.df.set_index('file_name'), left_on='fragment', right_index=True)
fails = fdf.groupby(['option_index', 'text', 'fragment']).match.mean().sort_values().to_frame().reset_index()

fails.loc[fails.match==0].head()

Unnamed: 0,option_index,text,fragment,match
0,0,minute,357181c5-2207-48da-9118-e95c564dba0f.ogg,0.0
1,2,minute,1babebef-e216-4e42-9e3e-30523d4d814e.ogg,0.0
2,1,zero,e9045da8-737b-46bd-9a9c-fe550485b8ed.ogg,0.0
3,1,hour,ca58adc6-1e31-4d07-ad52-4c696fcc12b6.ogg,0.0
4,1,minute,77eeb998-27c3-403d-b8cf-c0067800d4e5.ogg,0.0


In [10]:
from ipywidgets import VBox, HBox, Label

def view_fails(fails):
    return Query.df(fails).select(lambda z: HBox([Label(z.text+'/'+str(z.option_index)), Audio.from_file(host_path/z.fragment, autoplay=False)])).feed(list, VBox)

view_fails(fails.loc[fails.match==0])

VBox(children=(HBox(children=(Label(value='minute/0'), Audio(value=b'OggS\x00\x02\x00\x00\x00\x00\x00\x00\x00\…

Such analysis allows you to see the failures of TortoiseTTS. This way we understood and corrected:

* the upper bounds of string to voice: around 60-70. Sometimes, much longer strings can be processed, but sometimes no.

* that sequences like "six, sixteenth, sixth, sixtieth, sixty, tenth" are not the best way to organize the voiceover, as TortoiseTTS fails to pronounce "tenth" in this case.

* How to cut sequence like "three, four". It appears the correct cut is "three, " and "four". "three"/"four" will lose the ending of "three", and "three,"/" four" will add a noise to the beginning of "four". Unfortunately, there is no pause tag in TortoiseTTS that could improve this even further.

Moreover, you can use such an analysis to understand, which fragments need to be re-generated.

Note: to proof that these are really the faults of TortoiseTTS and not ones of Rhasspy, several times we did the following: take the sentences problematic for parsing, dub them as whole sentences via BrainBox and feed to Rhasspy again. This way, exactly 0 errors were produced. So if something fails, it is the fault of TortoiseTTS with a high probability: combinatorical issues such as wrong words/wrond limits to cut are normally checked by the unit tests, and Rhasspy is proven to be work with an adequate TortoiseTTS generated content.