<a href="https://colab.research.google.com/github/ixxan/ug-speech/blob/main/Fine_Tune_MMS_TTS_Uyghur.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune Meta MMS TTS for Uyghur with UQSpeech Dataset


Meta MMS TTS: https://huggingface.co/facebook/mms-tts

UQSpeech Dataset: https://github.com/gheyret/UQSpeechDataset

Finetune guide: https://github.com/ylacombe/finetune-hf-vits


In [None]:
!pip install datasets umsc pydub jiwer evaluate transformers

Collecting umsc
  Downloading umsc-0.3.0-py3-none-any.whl.metadata (4.3 kB)
Downloading umsc-0.3.0-py3-none-any.whl (22 kB)
Installing collected packages: umsc
Successfully installed umsc-0.3.0


In [None]:
import torch
if torch.cuda.is_available():
  device = "cuda"
else:
  device = 'cpu'
device

'cuda'

In [None]:
from huggingface_hub import login
from google.colab import userdata

login(token=userdata.get('HF_TOKEN'))

hf_username = "ixxan"
repo_name = "mms-tts-uig-script_arabic-UQSpeech"

## Clone Finetune VITS Env

In [None]:
# Clone the repository
!git clone https://github.com/ylacombe/finetune-hf-vits.git

# Change the working directory
%cd finetune-hf-vits

# Install requirements
!pip install -r requirements.txt

# Back to main directory
%cd ..

Cloning into 'finetune-hf-vits'...
remote: Enumerating objects: 173, done.[K
remote: Counting objects: 100% (58/58), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 173 (delta 49), reused 44 (delta 44), pack-reused 115 (from 1)[K
Receiving objects: 100% (173/173), 1.24 MiB | 17.88 MiB/s, done.
Resolving deltas: 100% (97/97), done.
/content/finetune-hf-vits
Collecting datasets>=2.14.7 (from datasets[audio]>=2.14.7->-r requirements.txt (line 2))
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.14.7->datasets[audio]>=2.14.7->-r requirements.txt (line 2))
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.14.7->datasets[audio]>=2.14.7->-r requirements.txt (line 2))
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets>=2.14.7->datasets[audio]>=2.14.7->-r requirem

## Build Cython Alignment

In [None]:
%cd finetune-hf-vits/monotonic_align
!mkdir monotonic_align
!python setup.py build_ext --inplace
%cd ..
%cd ..

/content/finetune-hf-vits/monotonic_align
Compiling core.pyx because it changed.
[1/1] Cythonizing core.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
performance hint: core.pyx:7:5: Exception check on 'maximum_path_each' will always require the GIL to be acquired.
Possible solutions:
	1. Declare 'maximum_path_each' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
	2. Use an 'int' return type on 'maximum_path_each' to allow an error code to be returned.
performance hint: core.pyx:38:6: Exception check on 'maximum_path_c' will always require the GIL to be acquired.
Possible solutions:
	1. Declare 'maximum_path_c' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
	2. Use an 'int' return type on 'maximum_path_c' to allow an error code to be returned.
performance hint: core.pyx:42:21: Exception check after calling 'maximum_path_each' will always require the GIL to

## Convert MMS Checkpoint

In [None]:
# %cd <path-to-finetune-hf-vits-repo>
# !python convert_original_discriminator_checkpoint.py \
#     --language_code <language-code> \
#     --pytorch_dump_folder_path <local-folder> \
#     --push_to_hub <repo-id-you-want>

In [None]:
%cd finetune-hf-vits
!python convert_original_discriminator_checkpoint.py \
    --language_code uig-script_arabic \
    --pytorch_dump_folder_path "./mms-uig-script_arabic-checkpoint" \
    --push_to_hub "{hf_username}/{repo_name}"
%cd ..

/content/finetune-hf-vits
2024-12-25 01:42:18.347238: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-25 01:42:18.366905: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-25 01:42:18.372922: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-25 01:42:18.387745: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
D_100000.pth: 100% 561M/561

## Prepare UQSpeech Data
Below is how I preprocessed UQSpeechData, the preprocessed data is already avaiable in HuggingFace: https://huggingface.co/datasets/ixxan/mms-tts-uig-script_arabic-UQSpeech, thus, you may skip this step. This is for reference only.

In [None]:
# Download UQSpeech (Uyghur Quran Reading splited into small files)
!gdown "https://drive.google.com/uc?id=1sqcMf0Gl5FEiURQCQAV1SWW4R4f_VQt2"
!7z x UQSpeech.7z

Downloading...
From (original): https://drive.google.com/uc?id=1sqcMf0Gl5FEiURQCQAV1SWW4R4f_VQt2
From (redirected): https://drive.google.com/uc?id=1sqcMf0Gl5FEiURQCQAV1SWW4R4f_VQt2&confirm=t&uuid=3166ad6c-20a6-4949-a1ed-c75078d02675
To: /content/UQSpeech.7z
100% 2.57G/2.57G [00:43<00:00, 59.0MB/s]

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,12 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (50657),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 2570754419 bytes (2452 MiB)

Extracting archive: UQSpeech.7z
--
Path = UQSpeech.7z
Type = 7z
Physical Size = 2570754419
Headers Size = 151426
Method = Delta LZMA2:24
Solid = +
Blocks = 4

  0%      0% 16 - UQSpeech/wavs/u0020005.wav                                      0% 36 - UQSpeech/wavs/u0020025.wav     

In [None]:
import os
import pandas as pd

# Read metadata
UQSpeech_csv_path = 'UQSpeech/metadata.csv'
UQSpeech_df = pd.read_csv(UQSpeech_csv_path, delimiter='|', header=None)
UQSpeech_df = UQSpeech_df.iloc[:, [0, 1]] # Only keep file path and arabic script
UQSpeech_df.columns = ["path", "sentence"]
UQSpeech_df['path'] = 'UQSpeech/wavs/' + UQSpeech_df['path'] + '.wav'
UQSpeech_df

In [None]:
# Convert to Dataset object
from datasets import Dataset, Audio

UQSpeech = Dataset.from_pandas(UQSpeech_df)
UQSpeech

In [None]:
print(UQSpeech[1])
print(UQSpeech[4])

In [None]:
# Remove punctuations from train set
import string

def remove_punctuation(batch):
    extra_punctuation = "–؛;،؟?«»‹›−—¬”“"  # Add your additional custom punctuation from the training set here
    all_punctuation = string.punctuation + extra_punctuation

    translator = str.maketrans('', '', all_punctuation)
    batch["sentence"] = batch["sentence"].translate(translator)
    return batch

UQSpeech = UQSpeech.map(remove_punctuation, num_proc=1)

print(UQSpeech[1])
print(UQSpeech[4])

In [None]:
# Find the unique chars from UQSpeech
unique_chars = set()
for example in UQSpeech:
    unique_chars.update(example["sentence"])

print(sorted(list(unique_chars)))
print("Characters in train:", len(unique_chars))

# Fine the unique chars from MMS checkpoint
import json
with open('/content/finetune-hf-vits/mms-uig-script_arabic-checkpoint/vocab.json', 'r') as f:
    vocab_data = json.load(f)
print(sorted(list(vocab_data.keys())))
print("Characters in checkpoint vocab:", len(vocab_data.keys()))

chars_in_UQSpeech_not_in_vocab = unique_chars - vocab_data.keys()
chars_in_vocab_not_in_UQSpeech = vocab_data.keys() - unique_chars
print("Characters in train not in checkpoint vocab:", chars_in_UQSpeech_not_in_vocab)
print("Characters in checkpoint vocab not in train:", chars_in_vocab_not_in_UQSpeech)

In [None]:
# Find the troublesome records
paths = set()
for c in chars_in_UQSpeech_not_in_vocab:
  for example in UQSpeech:
    if c in example["sentence"]:
      print(set(c))
      print(example["sentence"])
      paths.add(example['path'])
paths

In [None]:
# Drop troublesome records
UQSpeech = UQSpeech.filter(lambda example: example["path"] not in paths)
UQSpeech

In [None]:
# Cast the audio path to audio array
UQSpeech = UQSpeech.cast_column("path", Audio(sampling_rate=16000))

In [None]:
# Optional: Push the dataset to hugging face
UQSpeech.push_to_hub(repo_name)

Uploading the dataset shards:   0%|          | 0/9 [00:00<?, ?it/s]

Map:   0%|          | 0/1799 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/ixxan/mms-tts-uig-script_arabic-UQSpeech/commit/584ea46910752b483645791bd94a1dbc2a5d869c', commit_message='Upload dataset', commit_description='', oid='584ea46910752b483645791bd94a1dbc2a5d869c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/ixxan/mms-tts-uig-script_arabic-UQSpeech', endpoint='https://huggingface.co', repo_type='dataset', repo_id='ixxan/mms-tts-uig-script_arabic-UQSpeech'), pr_revision=None, pr_num=None)

## Finetune

In [None]:
# Define an finetune_uyghur.json as follow. Save the file to your working directory
# {
#     "project_name": "mms-tts-uig-script_arabic-finetune",
#     "push_to_hub": true,
#     "hub_model_id": "ixxan/mms-tts-uig-script_arabic-UQSpeech",
#     "overwrite_output_dir": true,
#     "output_dir": "./finetuned_uig_model",

#     "dataset_name": "ixxan/mms-tts-uig-script_arabic-UQSpeech",
#     "audio_column_name": "path",
#     "text_column_name":"sentence",
#     "train_split_name": "train",
#     "eval_split_name": "train",

#     "full_generation_sample_text": "مەككىدە نازىل بولغان يەتتە ئايەت",

#     "max_duration_in_seconds": 20,
#     "min_duration_in_seconds": 1.0,
#     "max_tokens_length": 500,

#     "model_name_or_path": "./finetune-hf-vits/mms-uig-script_arabic-checkpoint",

#     "preprocessing_num_workers": 4,

#     "do_train": true,
#     "num_train_epochs": 5,
#     "gradient_accumulation_steps": 1,
#     "gradient_checkpointing": false,
#     "per_device_train_batch_size": 16,
#     "learning_rate": 2e-5,
#     "adam_beta1": 0.8,
#     "adam_beta2": 0.99,
#     "warmup_ratio": 0.01,
#     "group_by_length": false,

#     "do_eval": true,
#     "eval_steps": 1000,
#     "per_device_eval_batch_size": 16,
#     "max_eval_samples": 25,
#     "do_step_schedule_per_epoch": true,

#     "weight_disc": 3,
#     "weight_fmaps": 1,
#     "weight_gen": 1,
#     "weight_kl": 1.5,
#     "weight_duration": 1,
#     "weight_mel": 35,

#     "fp16": true,
#     "seed": 456
# }

In [None]:
!accelerate launch finetune-hf-vits/run_vits_finetuning.py ./finetune_uyghur.json

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2024-12-25 01:51:40.165791: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-25 01:51:40.185450: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-25 01:51:40.191407: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-25 01:51:40.206396: I tensorflow/core/platform/cpu_feature_guard.cc:210] This 

In [None]:
from transformers import pipeline
import scipy
from IPython.display import Audio

model_id = "ixxan/mms-tts-uig-script_arabic-UQSpeech"
synthesiser = pipeline("text-to-speech", model_id, device=0) # add device=0 if you want to use a GPU

speech = synthesiser("ياق")

scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])
Audio(speech["audio"][0], rate=16000)

Device set to use cuda:0
