The second and third line will setup the phonemizer front end this will help conveting the raw text (graphemes) to phonetically sounding characters (phones)

In [1]:
from IPython.display import Audio, display

!apt-get install festival espeak-ng mbrola
%pip install -q phonemizer

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
Note: you may need to restart the kernel to use updated packages.


## Let's test the installed front end

The output of the next shell should be:
`həloʊ wɜːld wɛlkʌm tə ðə spiːtʃ sɪnθɪsɪs tuːtoːɹɪəl`

In [2]:
# Update the text here
text = "Hello world. Welcome to the speech synthesis tutorial."

!echo '{text}' | phonemize

fatal error: espeak not installed on your system


# Text-to-Speech using pretrained models

Let's play with [Matcha-TTS](https://arxiv.org/abs/2309.03199) to generate audio

Installing may take some time and you might need to restart the session once it is done installing. But next time it will not take long.

In [None]:
%pip install matcha-tts

### It will download checkpoints when you run it the first time

Location: `/root/.local/share/matcha_tts/`

Current working dir: `/content`

In [None]:
%cd /content/
%pwd

In [None]:
# It will download checkpoints when you run it the first time
!matcha-tts --text "Hello world. Welcome to the speech synthesis tutorial."

In [None]:
display(Audio('utterance_001.wav'))

## Multispeaker synthesis

In [None]:
!matcha-tts --text "Hello world. Welcome to the speech synthesis tutorial." --model matcha_vctk --spk 0
display(Audio('utterance_001_speaker_000.wav'))

In [None]:
!matcha-tts --text "Hello world. Welcome to the speech synthesis tutorial." --model matcha_vctk --spk 16
display(Audio('utterance_001_speaker_016.wav'))

# Fine-tuning to your own voice

## Clone Matcha-TTS for and install it in editable mode

`pip install -e <package>` enables editable mode. It is useful when trying to install a package locally, most often in the case when you are developing it on your system. It will just link the package to the original location, basically meaning any changes to the original package would reflect directly in your environment.

We clone it using git! Enter the directory and editable install it now any changes we make will be reflected back.

After restart directly jump till here.

In [None]:
!git clone https://github.com/shivammehta25/Matcha-TTS.git
%cd Matcha-TTS
!pip install -e .

## Preprocess dataset to make it ready to train

Create a directory data and other subdirectories to save wavs to `MyDataset/wavs/`

In [None]:
!rm data
!mkdir data
!mkdir data/MyDataset
!mkdir data/MyDataset/wavs

Using audio recording software record individual files in the format:

The each row of file list:
`location|speaker id|text`

```
data/MyDataset/wavs/1.wav|0|The skipper's and Nakata's gymnastics served as a translation without words.
data/MyDataset/wavs/2.wav|0|Knowing him, I review the old Scandinavian myths with clearer understanding.
data/MyDataset/wavs/3.wav|0|He told himself that as he washed himself and groomed his disheveled clothes.
data/MyDataset/wavs/4.wav|0|The river bared its bosom, and snorting steamboats challenged the wilderness.
data/MyDataset/wavs/5.wav|0|Once the jews harp began emitting its barbaric rhythms, Michael was helpless.
data/MyDataset/wavs/6.wav|0|He had observed the business life of Hawaii and developed a vaulting ambition.
data/MyDataset/wavs/7.wav|0|The Fire-Men wore animal skins around their waists and across their shoulders.
data/MyDataset/wavs/8.wav|0|The temperature dropped to fifty below zero and remained there the whole trip.
data/MyDataset/wavs/9.wav|0|The sunsets grow more bizarre and spectacular off this coast of the Argentine.
data/MyDataset/wavs/10.wav|0|He also contended that better confidence was established by carrying no weapons.
data/MyDataset/wavs/11.wav|0|A wildly exciting time was his during the week preceding Thursday the eighteenth.
data/MyDataset/wavs/12.wav|0|In short, my joyous individualism was dominated by the orthodox bourgeois ethics.
data/MyDataset/wavs/13.wav|0|The Japanese understood as we could never school ourselves or hope to understand.
data/MyDataset/wavs/14.wav|0|The hunters were still arguing and roaring like some semi-human amphibious breed.
data/MyDataset/wavs/15.wav|0|Of course much grumbling went on, and little outbursts were continually occurring.
data/MyDataset/wavs/16.wav|0|Down through the perfume-weighted air fluttered the snowy fluffs of the cottonwoods.
data/MyDataset/wavs/17.wav|0|The butchers and meat-cutters refused to handle meat destined for unfair restaurants.
data/MyDataset/wavs/18.wav|0|Mercedes screamed cried, laughed, and manifested the chaotic abandonment of hysteria.
data/MyDataset/wavs/19.wav|0|I also understand that similar branch organizations have made their appearance in Europe.
data/MyDataset/wavs/20.wav|0|A combination of Canadian capital quickly organized and petitioned for the same privileges.
```


We will override speaker to our own audio.

Record audio and move all files to `data/MyDataset/wavs`. We will ensure all wavs have the sample rate of 22050 otherwise we will resample it

The quality of the final output will depend highly on the training data. Therefore you should make sure to make your recordings as clear as possible. Here are two guides that can be useful: [here](https://speech.zone/exercises/build-a-unit-selection-voice/make-the-recordings/) and [here](https://speech.zone/exercises/build-a-unit-selection-voice/make-the-recordings/create-a-studio-at-home/).

If you have problem with finding a good program for recording you could use a browser based recorder like [this one](https://online-voice-recorder.com/).



Make sure that the files are named `1.wav` ... `20.wav`

In [None]:
from google.colab import files
import shutil
import os

# Upload multiple files
uploaded = files.upload()

# Define the target directory
target_directory = 'data/MyDataset/wavs'

# Move the uploaded files to the target directory
for filename, file_content in uploaded.items():
    target_path = os.path.join(target_directory, filename)
    with open(target_path, 'wb') as f:
        f.write(file_content)
    print(f'{filename} moved to {target_path}')

# Removing the files from the base folder
!rm [0-9]*.wav

In [None]:
!ls data/MyDataset/wavs

Above you should see your `.wav` files

In [None]:
%%capture
%cd data/MyDataset/wavs
# Fix the sample rate to match that of the pretrained vocoder
!for file in *.wav; do ffmpeg -i "$file" -ar 22050 -ac 1 "temp_${file}" && mv "temp_${file}" "$file"; done
%cd ../../../

Save the filelist as train.txt and val.txt both. (it is okay as we are not evaluating the model) Usually you would want to have two separate sets

In [None]:
%%writefile data/MyDataset/train.txt
data/MyDataset/wavs/1.wav|0|The skipper's and Nakata's gymnastics served as a translation without words.
data/MyDataset/wavs/2.wav|0|Knowing him, I review the old Scandinavian myths with clearer understanding.
data/MyDataset/wavs/3.wav|0|He told himself that as he washed himself and groomed his disheveled clothes.
data/MyDataset/wavs/4.wav|0|The river bared its bosom, and snorting steamboats challenged the wilderness.
data/MyDataset/wavs/5.wav|0|Once the jews harp began emitting its barbaric rhythms, Michael was helpless.
data/MyDataset/wavs/6.wav|0|He had observed the business life of Hawaii and developed a vaulting ambition.
data/MyDataset/wavs/7.wav|0|The Fire-Men wore animal skins around their waists and across their shoulders.
data/MyDataset/wavs/8.wav|0|The temperature dropped to fifty below zero and remained there the whole trip.
data/MyDataset/wavs/9.wav|0|The sunsets grow more bizarre and spectacular off this coast of the Argentine.
data/MyDataset/wavs/10.wav|0|He also contended that better confidence was established by carrying no weapons.
data/MyDataset/wavs/11.wav|0|A wildly exciting time was his during the week preceding Thursday the eighteenth.
data/MyDataset/wavs/12.wav|0|In short, my joyous individualism was dominated by the orthodox bourgeois ethics.
data/MyDataset/wavs/13.wav|0|The Japanese understood as we could never school ourselves or hope to understand.
data/MyDataset/wavs/14.wav|0|The hunters were still arguing and roaring like some semi-human amphibious breed.
data/MyDataset/wavs/15.wav|0|Of course much grumbling went on, and little outbursts were continually occurring.
data/MyDataset/wavs/16.wav|0|Down through the perfume-weighted air fluttered the snowy fluffs of the cottonwoods.
data/MyDataset/wavs/17.wav|0|The butchers and meat-cutters refused to handle meat destined for unfair restaurants.
data/MyDataset/wavs/18.wav|0|Mercedes screamed cried, laughed, and manifested the chaotic abandonment of hysteria.
data/MyDataset/wavs/19.wav|0|I also understand that similar branch organizations have made their appearance in Europe.
data/MyDataset/wavs/20.wav|0|A combination of Canadian capital quickly organized and petitioned for the same privileges.

In [None]:
%%writefile data/MyDataset/val.txt
data/MyDataset/wavs/1.wav|0|The skipper's and Nakata's gymnastics served as a translation without words.
data/MyDataset/wavs/2.wav|0|Knowing him, I review the old Scandinavian myths with clearer understanding.
data/MyDataset/wavs/3.wav|0|He told himself that as he washed himself and groomed his disheveled clothes.
data/MyDataset/wavs/4.wav|0|The river bared its bosom, and snorting steamboats challenged the wilderness.
data/MyDataset/wavs/5.wav|0|Once the jews harp began emitting its barbaric rhythms, Michael was helpless.
data/MyDataset/wavs/6.wav|0|He had observed the business life of Hawaii and developed a vaulting ambition.
data/MyDataset/wavs/7.wav|0|The Fire-Men wore animal skins around their waists and across their shoulders.
data/MyDataset/wavs/8.wav|0|The temperature dropped to fifty below zero and remained there the whole trip.
data/MyDataset/wavs/9.wav|0|The sunsets grow more bizarre and spectacular off this coast of the Argentine.
data/MyDataset/wavs/10.wav|0|He also contended that better confidence was established by carrying no weapons.
data/MyDataset/wavs/11.wav|0|A wildly exciting time was his during the week preceding Thursday the eighteenth.
data/MyDataset/wavs/12.wav|0|In short, my joyous individualism was dominated by the orthodox bourgeois ethics.
data/MyDataset/wavs/13.wav|0|The Japanese understood as we could never school ourselves or hope to understand.
data/MyDataset/wavs/14.wav|0|The hunters were still arguing and roaring like some semi-human amphibious breed.
data/MyDataset/wavs/15.wav|0|Of course much grumbling went on, and little outbursts were continually occurring.
data/MyDataset/wavs/16.wav|0|Down through the perfume-weighted air fluttered the snowy fluffs of the cottonwoods.
data/MyDataset/wavs/17.wav|0|The butchers and meat-cutters refused to handle meat destined for unfair restaurants.
data/MyDataset/wavs/18.wav|0|Mercedes screamed cried, laughed, and manifested the chaotic abandonment of hysteria.
data/MyDataset/wavs/19.wav|0|I also understand that similar branch organizations have made their appearance in Europe.
data/MyDataset/wavs/20.wav|0|A combination of Canadian capital quickly organized and petitioned for the same privileges.

## Create configuration files

In [None]:
%%writefile configs/data/my_dataset.yaml
defaults:
  - vctk.yaml
  - _self_

name: MyDataset
train_filelist_path: data/MyDataset/train.txt
valid_filelist_path: data/MyDataset/val.txt

In [None]:
%%writefile configs/experiment/my_audio.yaml
# @package _global_

# to execute this experiment run:
# python train.py experiment=my_audio

defaults:
  - override /data: my_dataset.yaml

# all parameters below will be merged with parameters from default configurations set above
# this allows you to overwrite only specified parameters

tags: ["finetuning"]

run_name: MyDataset


data:
  batch_size: 5

trainer:
  check_val_every_n_epoch: 10
  limit_val_batches: 2
  max_epochs: 2063
  # Basemodel we finetune on is trained for 1863 so we train for additional 200 epochs
  # It will be very fast since our dataset has only 20 sentences

In [None]:
# Download base model
!wget https://github.com/shivammehta25/Matcha-TTS-checkpoints/releases/download/v1.0/matcha_vctk.ckpt

### Lets train it for further 200 epochs

It should take 15-30 minutes.

Be aware that if you are doing this on google colab you might run out of compute!


In [None]:
!python matcha/train.py ckpt_path=matcha_vctk.ckpt experiment=my_audio

Then in "logs/train/MyDataset/runs" you will find run time date folder and under checkpoints find a ckpt path and pass it to the matcha-tts

In [None]:
date = !ls logs/train/MyDataset/runs
checkpoint_path = f"logs/train/MyDataset/runs/{date[0]}/checkpoints/last.ckpt" # If you have multiple models you might want to change the index in the list of dates
print(checkpoint_path)
!matcha-tts --checkpoint_path={checkpoint_path} --spk 0 --vocoder hifigan_univ_v1 --text "Test audio! This should sound like you now."

In [None]:
!ls
display(Audio('utterance_001_speaker_000.wav')) # If this causes an error make sure that the file that you are trying to play exists in the current folder

#Synthesising
If everything is working now and the test voice sounds similar to yours. You shall now generate 5-10 of the first test sentences from the [test sentences](https://www.cs.columbia.edu/~hgs/audio/harvard.html).

In [None]:
%%writefile test_sentences.txt
The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
These days a chicken leg is a rare dish.
Rice is often served in round bowls.
The juice of lemons makes fine punch.
The box was thrown beside the parked truck.
The hogs were fed chopped corn and garbage.
Four hours of steady work faced us.
A large size in stockings is hard to sell.


In [None]:
!matcha-tts --checkpoint_path={checkpoint_path} --spk 0 --vocoder hifigan_univ_v1 --file test_sentences.txt --batched --batch_size 10

In [None]:
!ls
display(Audio('utterance_000_speaker_000.wav'))
display(Audio('utterance_001_speaker_000.wav'))
display(Audio('utterance_002_speaker_000.wav'))
display(Audio('utterance_003_speaker_000.wav'))
display(Audio('utterance_004_speaker_000.wav'))
display(Audio('utterance_005_speaker_000.wav'))
display(Audio('utterance_006_speaker_000.wav'))
display(Audio('utterance_007_speaker_000.wav'))
display(Audio('utterance_008_speaker_000.wav'))
display(Audio('utterance_009_speaker_000.wav'))

## Downloading the audio files

If you are happy with the sound you should now download the audio files so that you can hand them in in the assignment. This can either be done by using the files explorer in google colab or by running the next codeblock. Make sure that the zip file that you download is not empty!

In [None]:
from google.colab import files
!zip test_audio_files.zip utterance_*_speaker_000.wav
files.download('test_audio_files.zip')


## Downloading the model

The following code block will download a checkpoint file that contiains the model. Since the file is so big we recomend upploading the file to a google drive. Therefore you should make sure to have around 500Mb available.

In [None]:
from google.colab import drive
!pwd
mount_path = "/content/Matcha-TTS/drive"
drive.mount(mount_path)
drive_file_path = mount_path + "/My\ Drive/myModel.ckpt"

!cp {checkpoint_path} {drive_file_path}

Now you should check your drive and make sure that a file named myModel.ckpt exists.