# Training

This is the final demonstrational notebook, that post-processes the upsampled records, and builds the training dataset. This dataset will then be fed to the PiperTraining, and the output of the PiperTraining will be evaluated.

First, I will upload the demo-files to the brainbox to guarantee reproducibility of the notebook.

In [1]:
from brainbox import BrainBox
from yo_fluq import *

api = BrainBox.Api('127.0.0.1:8090')
for file in Query.folder('../files/upsampling_example/wavs/'):
    api.upload(file.name, file)

First, you need to setup the Exporter. This entity will transform the `UpsamplingItem` into a much smaller `ExportItem`. In theory, for new voice cloners and new recognitions, we may want to design other UpsamplingItems with different set of fields, and Exporter will help to standardize them for annotation and dataset building. 

The conversion comes in two steps: first, the ExporterItem is created. Then, when needed, the wav-content is created, as we don't want to have it in memory all the time.

In [2]:
from chara.voice_clone.training import TrimExporter, StringDistance

exporter = TrimExporter(
    api = api,
    distance = StringDistance(),
    trim_text_start = 'The beginning. ',
    trim_text_end = ' The end.',
    max_vosk_trim=6,
    time_margin_in_seconds=0.2,
)

records = [exporter.export(rec) for rec in FileIO.read_pickle('../files/upsampling_example/records.pkl')]

This configuration will cut `trim_text_start` from the start of the text and `trim_text_end` from the end. It will also cut at most `max_vosk_trim` from the start and the end of the Vosk transcribtion so that the result matches the trimmed text the best in terms of `StringDistance`. This step is done by selecting the best of all the possible Vosk's trimmings, because there might be words added or removed to Vosk transcription.

When exporting, the sound will be cut at `time_margin_in_seconds` before/after the first/last relevant word recognized by vosk. 

There is also `max_interword_margin` parameter that will remove the too-long pauses inside the sound fragment, if they are present.

Let's now view the resulting array of the records:

In [3]:
import pandas as pd

df = pd.DataFrame([r.__dict__ for r in records])
df.groupby('character').duration.sum()

character
Axe     11.97
Lina    10.86
Name: duration, dtype: float64

Now we can annotate these records:

In [4]:
from chara.voice_clone.training import Annotator
from pathlib import Path
import os

path = Path('temp/annotation.txt')
if path.is_file():
    os.unlink(path)

annotator = Annotator('temp/annotation.txt', Annotator.Data(records), exporter, (Annotator.Feedback.yes, "No"), randomize=False)
_, share_url, __ = annotator.create_interface().launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


In the interface, you may choose the voice and then push the buttons to annotate the samples. Some buttons have special meaning to the algorithm, and must be created with the labels, defined in `Annotator.Feedback`. 

The following code is testing the created interface for the integration tests of the demo. You should remove it when working with this code.

In [5]:
from gradio_client import Client
client = Client(share_url)
client.predict(voice="Lina", api_name="/cm_voice_change")
client.predict(api_name="/cm_next_sample")[3]['label']
id1 = annotator._current_id
client.predict('Yes',api_name="/cm_feedback")
client.predict(api_name="/cm_next_sample")[3]['label']
id2 = annotator._current_id
client.predict('No', api_name='/cm_feedback')

annotation_data = Annotator.read_annotation_file('temp/annotation.txt')
annotation_data = {rec['id']:rec['feedback'] for rec in annotation_data}
assert annotation_data[id1] == 'Yes'
assert annotation_data[id2] == 'No'

Loaded as API: http://127.0.0.1:7860/ ✔


You may now use annotation data to filter the records. After (or instead of) that, you may export the records in the format that is supported by `PiperTraining`.

Note: `PiperTraining` does support the multiple-voice models. However, my experience with them is that the quality is poorer than for several independent models. Thus, I would rather suggest creating a model per character, and for thet you need to create a dataset per each character:

In [6]:
characters = set(rec.character for rec in records)
for character in characters:
    exporter.export_to_zip([rec for rec in records if rec.character==character], f'temp/dataset_{character}.zip')

In [7]:
assert Path('temp/dataset_Axe.zip').is_file()
assert Path('temp/dataset_Lina.zip').is_file()

After this, the following steps are required:

**Generate enough data**

What is "enough" is debatable. With Tortoise and manual annotation I was able to train a good voice on as mush as 27 minutes. But in this case, the samples were hand-picked, that took a lot of time to annotate, and I wouldn't want to repeat this experience. 

For Zonos I don't need to annotate, but that means some failures make it to the dataset even after all the procedures with Vosk (e.g. coughing in the middle of the sentence). The quality is lower, and thus more data is required for training. 40 minutes were not enough, 60 minutes were enough with minor glitches, 90 minutes was okay (for German). Otherwise, there is a noise and "sand" in the records, probably coming from the rare cases when Zonos failed to produce voice properly. Bigger dataset definitely helps with this problem.

**Run the training**

Checkout the `PiperTraining` self-test on how to do it. You will need to set up the `TrainingSettings`, and here:

* `language` should be set according to espeak notation
* `base_model` you'll need to download, see `PiperTrainingSettings` for details. They are available here https://huggingface.co/datasets/rhasspy/piper-checkpoints/
* `batch_size` should be small enough so you wouldn't run out of memory
* `max_epochs` should be the a sum of how long the original model was training (available in the model's filename) and an amount of epochs you want to train the model. The epoch count for 60 minutes of dataset was 1500, and the quality was oscillating in the end. 
* If `keep_intermediate` is set to True, the intermediate checkpoints, not only the final one, will be kept and you will be able to pick the best one. I recommend turning this option on, but also adjust `checkpoint_epochs` to keep the disk from overfilling.

The PiperTraining default usage creates a pair of training and exporting task, and if left alone, they will fully perform the prodecure. If you want to cancel the training or train for more steps, you can always do it by creating the training task manually. After such a sequence of trainings, manually created export task will convert the `ckpt` files to `onnx`, required for Piper.

**Pick the best model**

In my observations, it wasn't the case that the longer model trains the better result is. At the late stages, the quality oscillates, probably because the model walking in cycles. 

