Now everything is set to export data from (several) annotations into one MediaLibrary, and run the container on this MediaLibrary. 

In [1]:
from pathlib import Path
from kaia.ml.voice_cloning.data_prep.data_export import finalize_annotation
from kaia.ml.voice_cloning.coqui_training_container import CoquiTrainingContainerSettings
import os

# Some symbol replacements: few symbols worked fine with Tortoise, but CoquiTTS refused to consume it, so it's a good place for the final fix
replacements = {
    '–': '-',
    "’": "'",
    '…': '...',
    '"': '',
    '—':'-'
} 

# If you're experimenting with the voices, you might have several version of one voice in different annotated MediaLibraries. 
# This will help merge them together
voice_replacements = {
}

# You may have several media libraries, each coming with its own annotation file. This is how you merge everything
lib_to_annotation = {
    Path('files/voicelines.zip'): Path('files/annotation.txt')
}

settings = CoquiTrainingContainerSettings()
os.makedirs(settings.resource_folder, exist_ok=True)
dataset_location = settings.resource_folder/'dataset.zip'
finalize_annotation(lib_to_annotation, dataset_location, replacements, voice_replacements)

  0%|          | 0/9 [00:00<?, ?it/s]

There is a couple of useful statistics. First, the total duration for each voice. If my experiments with YourTTS and VITS, 20 minutes was okayish-enough for YourTTS, and definitely enough for VITS. Since generating and annotating samples is hard, I'd recommend to start with maybe 10 minutes if you only plan to train VITS.

In [5]:
from kaia.brainbox import MediaLibrary

lib = MediaLibrary.read(dataset_location)
odf = lib.to_df()
df = odf.loc[odf.selected]
df.head()

Unnamed: 0,voice,text,option_index,origin,selected,mark,duration,filename,timestamp,job_id
0,lina,"Trollocs were usually cowards in their way, pr...",0,voicelines.zip,True,good,5.749333,8de83d10-129a-4c01-a738-17479ea8fc73.wav,2024-05-27 17:05:32.214947,id_1438b7bc60e44374963cd30343f87dc5
3,lina,"The Deathwatch Guard has charge of my safety, ...",0,voicelines.zip,True,good,5.237333,2e0e6788-a923-4a26-a2b1-1fe03af3802b.wav,2024-05-27 17:05:32.214947,id_5ab1171984414b72a7e064664cf769b1
6,lina,"Jhogo gave a pull on the whip, yanking Viserys...",0,voicelines.zip,True,good,5.152,fecad6be-7fb9-43df-aea8-d4ba6c8e623d.wav,2024-05-27 17:05:32.214947,id_e78f2a86c3ad45aa87875ab93bae669b


In [6]:
df.groupby('voice').duration.sum()

voice
lina    16.138667
Name: duration, dtype: float64

And here is how you can check that all the phonemes and letters were covered, and how much. Obviously, since in the demo I only created a voicelines for 3 sentences, a lot is not covered. In my experiments, the minimal coverage was 20. Again, it might be possible with lesser amounts.

Another thing is that __maybe__ we should introduce a phoneme/letter in the word's end position and stratify/measure for these variants separately. Silencing of the tailing letters is a huge problem in T

In [3]:
from kaia.ml.voice_cloning.data_prep.data_export import dataset_features_statistics

sdf = dataset_features_statistics('files/golden_set.json', df)

In [4]:
sdf.sort_values('cnt').head(10)

Unnamed: 0,feature,voice,cnt
84,ᵻ,lina,0.0
37,iə,lina,0.0
28,aɪɚ,lina,0.0
27,aɪə,lina,0.0
25,_z,lina,0.0
23,_x,lina,0.0
44,n̩,lina,0.0
46,oː,lina,0.0
47,oːɹ,lina,0.0
63,ɔɪ,lina,0.0


With the resulting MediaLibrary, you may start the training. Alternatively, if you can redo the upsampling: in this case the media library can be passed to `generate_tasks` method to prevent repeating the upsampling for texts/voices for which an acceptable voiceline was found. 