# Upsampling

Everyone makes mistakes, and TTS does a lot of them. 

The intuition is you need to remove all those from the training set. I didn't check it experimentally with the current setup, but from prior attempts I know that cutting sentences short renders VITS training unable to read some tailing dull sounds like "g" or "b" at all, and I had to build workarounds for Tortoise to make sure it reads them. Piper also uses VITS for training so probably it's still the case. However, I didn't research to what extent problems of these kind affect the training.

So to remove these and other errors, I implemented several steps from `brainbox.flow', which I will explain in this demo.

First, let's install the required deciders. Instead of Zonos, we will CoquiTTS, because it's a bit faster and also less demanding to the resources and the CUDA drivers. 

In [1]:
from brainbox import BrainBox
from brainbox.deciders import CoquiTTS, Vosk, Resemblyzer

api = BrainBox.Api('127.0.0.1:8090')

for decider in [CoquiTTS, Vosk, Resemblyzer]:
    api.controller_api.install_if_not_installed(decider)

Now, let's define the steps. The steps will refine an array of `UpsamplingItem` objects by filling its fields. We will start with the items that represent a text dataset we upsample. To each sentence, I will add prefix and suffix: Zonos and TortoiseTTS have a tendency to fail exactly at the end and at the beginning of the sentences, so this trick will help to improve the successfulness of the upsampling.

In [2]:
from yo_fluq import *
from brainbox.flow import ConstantStep
from chara.voice_clone.upsampling import UpsamplingItem

sentences = FileIO.read_json('../files/sentences.json')[:10]
initial_step = ConstantStep([UpsamplingItem('The beginning. '+sentence+' The end.') for sentence in sentences])

Then, each text would need a `VoiceCloner` to convert it to sound. This is represented in the following step. This step also allows you to train the voice cloner on the data supplied. 

In [3]:
from chara.voice_clone import upsampling 
from chara.voice_clone.voice_cloner import Character, CoquiVoiceCloner
from chara.tools import Language
from pathlib import Path

folder = Path('../files').absolute()
characters = [
    Character('Axe', folder/'axe'),
    Character('Lina', folder/'lina')
]
cloners = [CoquiVoiceCloner(c, Language.English()) for c in characters]
for c in cloners:
    c.model_prefix = 'voice_clone_demo'

assign_voice_cloners_step =  upsampling.AssignVoiceClonersStep(cloners)
assign_voice_cloners_step.get_training_command().with_cache('temp/voice_training.json').execute(api)

[{'error': None,
  'id': 'id_f15286dcadc043bf9afad578df5a0821',
  'result': 'OK',
  'tags': {'model_name': 'voice_clone_demo_Axe',
   'upsampler': 'CoquiVoiceCloner'}},
 {'error': None,
  'id': 'id_0aab3a97bf7c47fda48fce0ae947f9a7',
  'result': 'OK',
  'tags': {'model_name': 'voice_clone_demo_Lina',
   'upsampler': 'CoquiVoiceCloner'}}]

After these two steps, we will have a Cartesian product of all the texts and all the cloners. It is important to exclude the sentences that were already voiced by the characters. One of my experiences was that if you don't do this, if you leave several versions of voiceovers for the same character and same text, the model won't train.

In [4]:
filter_step = upsampling.FilterStep()

I recommend to further decrease the amount of samples for one voiceover. This way each iteration will take shorter time, and you can review the results in-between. The following step will reduce it to the batch, also favoring the voice cloners that do not yet have enough of successfully processed entries. In this notebook, we define the batch size to 5. In the real environment, set it to 100 or more.

In [5]:
from brainbox.flow import ProbabilisticEqualRepresentationStep

batching_step = ProbabilisticEqualRepresentationStep(10, lambda z: z.voice_cloner.get_voice_cloner_key(), lambda z: z.selected)

Now we can define the main parts of the process.

In [6]:
voiceover_step_factory = upsampling.VoiceoverStepFactory(api)
vosk_step_factory = upsampling.VoskStepFactory(api)

Additionally, you may want to add Resemblyzer to the system. This will help to determine how close each voiceover is to the original voices. This step may be important for TortoiseTTS.

In [7]:
resemblyzer_step_factory = upsampling.ResemblyzerStepFactory(api, 'voice_clone_demo')
resemblyzer_step_factory.get_training_command([c.samples_folder for c in characters]).with_cache('temp/resemblyzer_training.json').execute(api)

{'accuracy': None, 'stats': None}

Finally, we need to select the successful voiceovers. This is done with this step. What it does is computing the Levenschtein distance between standardized text and vosk recognition. I will set the max allowed distance to 0, which is very strict, but acceptable for English. 

In [8]:
selection_step = upsampling.VoskStatisticsStep(upsampling.StringDistance(), 0)

That's it. Now we can run the whole thing:

In [9]:
from brainbox.flow import Flow

flow = Flow(
    'temp/upsampling',
    [
        initial_step,
        assign_voice_cloners_step,
        filter_step,
        batching_step,
        voiceover_step_factory.create_step().with_name('voiceover'),
        vosk_step_factory.create_step().with_name('vosk'),
        resemblyzer_step_factory.create_step().with_name('resemblyzer'),
        selection_step
    ])
flow.reset()
flow.run(1)

TOTAL 0
0 records -> Start ConstantStep, 0/8 -> 10 records
10 records -> Start AssignVoiceClonersStep, 1/8 -> 20 records
20 records -> Start FilterStep, 2/8 -> 20 records
20 records -> Start ProbabilisticEqualRepresentationStep, 3/8 -> 10 records (Axe: 6, Lina: 4)
10 records -> Start voiceover, 4/8 -> 

10 records
10 records -> Start vosk, 5/8 -> 

10 records
10 records -> Start resemblyzer, 6/8 -> 

10 records
10 records -> Start VoskStatisticsStep, 7/8 -> 10 records (selected 5/10, 50% success rate)
TOTAL 10


In [10]:
data = flow.read_flatten()
df = UpsamplingItem.to_df(data)
df.groupby('character').selected.mean()

character
Axe     0.166667
Lina    1.000000
Name: selected, dtype: float64

The following code create an upsampling example for the next step demonstration. Uncomment if you want to update the samples.

In [11]:
if False:
    df = df.loc[df.selected].feed(fluq.add_ordering_column('character','duration'))
    df = df.loc[df.order<2]
    for item in df.file:
        api.download(item, f'../files/upsampling_example/wavs/{item}')
    data = [d for d in flow.read_flatten() if d.voiceover_file in set(df.file)]
    FileIO.write_pickle(data, '../files/upsampling_example/records.pkl')

# Additional notes

For Zonos, following types of the mistakes are common:
1. Reading random rubbish instead of the given text.
2. Adding coughing, yawning and all sorts of noise in the beginning.
3. Cutting sentence short of 1-2 syllables.

These problems are solved with the Vosk recognition and addition of prefixes and suffixes to the texts. Fortunately, Zonos almost always picks the voice of the character correctly, so Resemblyzer step is effectively useless. 

However, by my experience, TortoiseTTS still picks the intonations and the characters of the voice much better that Zonos. When I train on TortoiseTTS upsampling, the resulting voices are much more vivid. TortoiseTTS, however, often fails to pick the voice correctly, this is why Resemblyzer may still be useful. Also, with TortoiseTTS, the manual annotation of the results is needed, which is also going to be covererd in the next session.

## Transferring voice through language

Sometimes we want to train e.g. German voice with the samples of English voice. This can be done! However, if English-trained Zonos is producing German voiceovers, the following mistakes will be quite common:

* The absence of glottal stops. This makes German completely incomprehensible. 
* The English pronounciation of words like "er" or "dir", and generally English "r" everywhere


If you train the model on this output, the result is going to be horrible. I think this is going to be true for other combinations of languages, not only for English->German.

To avoid this, first use the English-trained Zonos to produce ~10 minutes of German voiceovers. It's going to take a long time, as the success rate will be quite low. Then, annotate these voiceovers and pick those that sound German-enough for you. This doesn't have to be absolutely perfect, just better than average.

Then, use these annotated voiceovers as samples and produce the final German upsampling. The success rate is going to increase, and the output is going to be much better-sounding. Using Zonos-generated voiceovers as the training data for Zonos decreases the quality, but not to the point of voices to become unrecognizable. 

## Vosk parameters with German language

I recommend setting the transformer for StringDistance that replaces `ß` with `ss`: these two sound exactly the same and Vosk won't be able to pick the correct pronounciation, so 0-distance will be simply impossible.

Additionally, German has declinations, and e.g. words like "jede" and "jeder" are sometimes impossible to distinguish by ear, especially if spoken briefly. So I'd recommend to increase the max allowed distance to 2 or 3.