Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for PR review: Add support for Thai language #120

Open
jadechip opened this issue Apr 30, 2024 · 63 comments
Open

Request for PR review: Add support for Thai language #120

jadechip opened this issue Apr 30, 2024 · 63 comments

Comments

@jadechip
Copy link

I have created the following PR to add support for Thai language.
I am in the process of creating a dataset to train the model but would love a PR review of the code first to make sure I am on the right track.

Thank you!

#117

@tchayintr
Copy link

tchayintr commented Apr 30, 2024

Great job! I planed to work on training a Thai TTS model using MeloTTS too.

@Zengyi-Qin
Copy link
Contributor

Hi - Thanks for the contribution. We would suggest you first train on the Thai dataset to see if the code works. We haven't had any attempt to train on Thai

@jadechip
Copy link
Author

jadechip commented May 1, 2024

@Zengyi-Qin Sounds good, will report back once I have proper training results.

@jadechip
Copy link
Author

jadechip commented May 1, 2024

Thank you @tchayintr, if you have any recommendations for Thai audio datasets, I would greatly appreciate it!

@tchayintr
Copy link

@jadechip
Sure! There are several datasets such as TSync2, Lotus, etc. You can check several of them here: https://github.com/korakot/corpus/releases/tag/v1.0 with documentation at https://lexitron.nectec.or.th/KM_HL5001/file_HL5001/Document/krrn_14518.pdf.

There are also Thai dialects available at https://github.com/SLSCU/thai-dialect-corpus.

However, I recommend collecting clear voice clips and crafting their transcriptions with ASR tools like WhisperX. This way, you can generate a lot of samples, but you may need to fine-tune it for the Thai language 😄.

I am reviewing your commits too. They mostly look great 🎆 , but I found some points that need to be clarified.
I will clarify and let you know if there is a point that may need to be adjusted in terms of Thai linguistic knowledge.

@jadechip
Copy link
Author

jadechip commented May 1, 2024

@tchayintr this is super helpful and any feedback you have for my code will be greatly appreciated 🙏
I was also looking at this other nectec dataset: https://github.com/vistec-AI/dataset-releases/releases/tag/v1
I'll work on creating transcriptions next and report back.

@jadechip
Copy link
Author

jadechip commented May 7, 2024

@Zengyi-Qin are there any additional steps or files needed before training? I am getting the following error:

output

⚡ add-thai ~/MeloTTS/melo torchrun --nproc_per_node=1 --master_port=10902 train.py --c data/thai/config.json --model thai
2024-05-07 15:24:58.152 | INFO     | data_utils:_filter:64 - Init dataset...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141910/141910 [00:04<00:00, 32864.77it/s]
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:84 - min: 65; max: 987
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:85 - skipped: 327, total: 141910
buckets: [92994, 31326, 11604, 4350, 1068, 156, 84, 24]
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2024-05-07 15:25:02.699 | INFO     | data_utils:_filter:64 - Init dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32832.13it/s]
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:84 - min: 164; max: 625
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:85 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:01<?, ?it/s]
Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 194, in __getitem__
    return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 98, in get_audio_text_speaker_pair
    bert, ja_bert, phones, tone, language = self.get_text(
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 180, in get_text
    raise
RuntimeError: No active exception to reraise

...it seems to happen around line 200 in train.py
image

config.json

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 6,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "data/thai/train.list",
    "validation_files": "data/thai/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "TH-default": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 9,
  "num_tones": 17,
  "symbols": [
    "_",
    "\"",
    "(",
    ")",
    "*",
    "/",
    ":",
    "AA",
    "E",
    "EE",
    "En",
    "N",
    "OO",
    "Q",
    "V",
    "[",
    "\\",
    "]",
    "^",
    "a",
    "a:",
    "aa",
    "ae",
    "ah",
    "ai",
    "an",
    "ang",
    "ao",
    "aw",
    "ay",
    "b",
    "by",
    "c",
    "ch",
    "d",
    "dh",
    "dy",
    "e",
    "e:",
    "eh",
    "ei",
    "en",
    "eng",
    "er",
    "ey",
    "f",
    "g",
    "gy",
    "h",
    "hh",
    "hy",
    "i",
    "i0",
    "i:",
    "ia",
    "ian",
    "iang",
    "iao",
    "ie",
    "ih",
    "in",
    "ing",
    "iong",
    "ir",
    "iu",
    "iy",
    "j",
    "jh",
    "k",
    "ky",
    "l",
    "m",
    "my",
    "n",
    "ng",
    "ny",
    "o",
    "o:",
    "ong",
    "ou",
    "ow",
    "oy",
    "p",
    "py",
    "q",
    "r",
    "ry",
    "s",
    "sh",
    "t",
    "th",
    "ts",
    "ty",
    "u",
    "u:",
    "ua",
    "uai",
    "uan",
    "uang",
    "uh",
    "ui",
    "un",
    "uo",
    "uw",
    "v",
    "van",
    "ve",
    "vn",
    "w",
    "x",
    "y",
    "z",
    "zh",
    "zy",
    "~",
    "æ",
    "ç",
    "ð",
    "ø",
    "ŋ",
    "œ",
    "ɐ",
    "ɑ",
    "ɒ",
    "ɔ",
    "ɕ",
    "ə",
    "ɛ",
    "ɜ",
    "ɡ",
    "ɣ",
    "ɥ",
    "ɦ",
    "ɪ",
    "ɫ",
    "ɬ",
    "ɭ",
    "ɯ",
    "ɲ",
    "ɵ",
    "ɸ",
    "ɹ",
    "ɾ",
    "ʁ",
    "ʃ",
    "ʊ",
    "ʌ",
    "ʎ",
    "ʏ",
    "ʑ",
    "ʒ",
    "ʝ",
    "ʲ",
    "ˈ",
    "ˌ",
    "ː",
    "̃",
    "̩",
    "β",
    "θ",
    "ก",
    "ข",
    "ฃ",
    "ค",
    "ฅ",
    "ฆ",
    "ง",
    "จ",
    "ฉ",
    "ช",
    "ซ",
    "ฌ",
    "ญ",
    "ฎ",
    "ฏ",
    "ฐ",
    "ฑ",
    "ฒ",
    "ณ",
    "ด",
    "ต",
    "ถ",
    "ท",
    "ธ",
    "น",
    "บ",
    "ป",
    "ผ",
    "ฝ",
    "พ",
    "ฟ",
    "ภ",
    "ม",
    "ย",
    "ร",
    "ล",
    "ว",
    "ศ",
    "ษ",
    "ส",
    "ห",
    "ฬ",
    "อ",
    "ฮ",
    "ะ",
    "ั",
    "า",
    "ำ",
    "ิ",
    "ี",
    "ึ",
    "ื",
    "ุ",
    "ู",
    "เ",
    "แ",
    "โ",
    "ใ",
    "ไ",
    "ๅ",
    "็",
    "่",
    "้",
    "์",
    "๐",
    "๑",
    "๒",
    "๓",
    "๔",
    "๕",
    "๖",
    "๗",
    "๘",
    "๙",
    "ᄀ",
    "ᄁ",
    "ᄂ",
    "ᄃ",
    "ᄄ",
    "ᄅ",
    "ᄆ",
    "ᄇ",
    "ᄈ",
    "ᄉ",
    "ᄊ",
    "ᄋ",
    "ᄌ",
    "ᄍ",
    "ᄎ",
    "ᄏ",
    "ᄐ",
    "ᄑ",
    "ᄒ",
    "ᅡ",
    "ᅢ",
    "ᅣ",
    "ᅤ",
    "ᅥ",
    "ᅦ",
    "ᅧ",
    "ᅨ",
    "ᅩ",
    "ᅪ",
    "ᅫ",
    "ᅬ",
    "ᅭ",
    "ᅮ",
    "ᅯ",
    "ᅰ",
    "ᅱ",
    "ᅲ",
    "ᅳ",
    "ᅴ",
    "ᅵ",
    "ᆨ",
    "ᆫ",
    "ᆮ",
    "ᆯ",
    "ᆷ",
    "ᆸ",
    "ᆼ",
    "ㄸ",
    "!",
    "?",
    "…",
    ",",
    ".",
    "'",
    "-",
    "¿",
    "¡",
    "SP",
    "UNK"
  ]
}

@jadechip
Copy link
Author

jadechip commented May 8, 2024

Nevermind, I was able to pinpoint the issue, I didn't realize you needed to add the language code here as well:
image

I've updated my PR with the missing code.
I seems like it is training correctly now although I am still getting some warnings/exceptions:

Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:00<?, ?it/s][W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/autograd/__init__.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 9, 96], strides() = [99168, 96, 1]
bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Evaluating ...
Evauate done
  0%|▍                                                                                                                                           | 74/23601 [03:24<11:00:36,  1.68s/it]min value is  tensor(-1.1265)

Will try to run the complete training loop on some H100s 🤞

@acul3
Copy link

acul3 commented May 8, 2024

hello @jadechip
let me know if its working

i'm training for indonesia and malay language

changing phonem and bert also

after 10 epoch the model doesnt produce any good word, only some noise , some random vowel

my data
~200hours dataset
~500 speaker

@jeremy110
Copy link

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model.
https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

@jadechip
Copy link
Author

jadechip commented May 9, 2024

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

Thank you @jeremy110.
If I understand correctly in melo/models.py, we should first initialize the TextEncoder with the original 219, in order to use the retrained weights, like this:

// models.py
        self.enc_p = TextEncoder(
            219,  # Initialize with the original symbol size
            inter_channels,
            hidden_channels,
            filter_channels,
            n_heads,
            n_layers,
            kernel_size,
            p_dropout,
            gin_channels=self.enc_gin_channels,
            num_languages=num_languages,
            num_tones=num_tones,
        )

...then right after add a check if the n_vocab (len(symbols)) has a different size, and if so update the self.enc_p.embed_tokens with the resized embeddings?

if n_vocab != 219:
    old_embeddings = self.enc_p.emb
    new_num_tokens = n_vocab
    self.enc_p.emb = self.get_resized_embeddings(old_embeddings, new_num_tokens)

Does that look correct to you?
Note: I've updated my PR to reflect this.

@jeremy110
Copy link

hello~ @jadechip

Yes, it looks fine as it is.

However, in symbols.py, you'll need to make some modifications. If you place your new symbol inside the sorted list and then use the method above, it may result in some symbols having weights that don't match up with the original model. So, I suggest you do it like this.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + new_symbols # add new symbols here 

@jadechip
Copy link
Author

jadechip commented May 9, 2024

I see, thank you for the heads up @jeremy110 🙏
I've updated my code to reflect your suggestion, now I have.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + th_symbols
sil_phonemes_ids = [symbols.index(i) for i in pu_symbols]

# combine all tones
num_tones = num_zh_tones + num_ja_tones + num_en_tones + num_kr_tones + num_es_tones + num_fr_tones + num_de_tones + num_ru_tones + num_th_tones

# language maps
language_id_map = {"ZH": 0, "JP": 1, "EN": 2, "ZH_MIX_EN": 3, 'KR': 4, 'ES': 5, 'SP': 5, 'FR': 6, 'TH': 7}
num_languages = len(language_id_map.keys())

I'll try running a new training job to evaluate performance with these changes.

@acul3
Copy link

acul3 commented May 9, 2024

thanks @jadechip and @jeremy110

i'll try it to my environment also,see if works

@jadechip
Copy link
Author

Ok, I was able to run a training job for around 9k steps yesterday. I tried running inference using the new checkpoint, but it seems to produce unintelligible sounds. I think the learning rate looks ok though? ...so I will try ramping up the batch size and training for longer on multiple GPUs and report back with my results 🤞
For reference here is my current config and Tensorboard metrics.

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 16,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "../Data/locutor/train.list",
    "validation_files": "../Data/locutor/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "locutor": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 1,
  "num_tones": 16,
  "symbols": [
...
Screenshot 2567-05-10 at 23 57 48 Screenshot 2567-05-10 at 23 57 41 Screenshot 2567-05-10 at 23 57 00 Screenshot 2567-05-10 at 23 56 54

@jadechip
Copy link
Author

jadechip commented May 11, 2024

btw I am currently training on a subset of Thai commonvoice 13, converted to .wav with a sample rate of 48 kHz.
Edit: Happy weekend everyone 🎉

@jeremy110
Copy link

hello~ @jadechip

My config is basically the same as yours, except my batch size is 6. Perhaps you can increase your learning rate to 9e-4 and see how it performs. Also, I've added a constraint to the clip_grad_value in the code.

grad_norm_d = commons.clip_grad_value_(net_d.parameters(), 200)
grad_norm_g = commons.clip_grad_value_(net_g.parameters(), 500)

Finally, I'm attaching my tensorboard for reference. (https://drive.google.com/drive/folders/1xPNURmWsmJqwEDHVM8ZsK6CAbuv65ipI?usp=sharing)

Additionally, if the silence before and after your audio files is shorter, your g/dur will converge to a smaller value, which will also affect the length of the silence before and after the inference.

I'm not sure if the Thai CommonVoice 13 dataset is suitable for training. Also, there's no need to specifically convert it to 48kHz. I remember that the code will resample it. I think you can start by testing whether it can be trained with 10 hours of data from one person.

I hope this is helpful for you.

@jadechip
Copy link
Author

Thank you for you sharing! Your advice has been super helpful @jeremy110 🙏

@jadechip
Copy link
Author

Hmm trained for longer with different hyperparameters but so far the results are not much better, something might be wrong with my code.

@acul3
Copy link

acul3 commented May 15, 2024

yeah me too

longer training,,the voice is clearer and similar, but cant pronounce a single word

maybe phenomizer problem ,idk

@jeremy110
Copy link

jeremy110 commented May 15, 2024

hello @jadechip @acul3
I'd like to confirm something. Are all your tones set to 0?
Because I made a similar mistake before where I treated tones like ˧ ˦ as phones, but they should correspond to tones. Here's an example of what I did before.

#error
phones: ['_', 'k', 'e', 'ʔ', '˧', 'p', 'i', 'a', 'ʔ', '˧', 'ʦ', 'ʰ', 'i', 'n', '˦', '˦', 'k', 'e', '˦', '˦', ',', 'l', 'e', '˥', '˧', 's', 'ɔ', '˨', '˩', 'g', 'u', 'a', 'n', '˩', '˧', 'ʦ', 'a', 'i', '˧', '˧', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 4, 5, 6, 4, 1, 4, 4, 6, 5, 1, 1]
#correct
phones: 28 ['_', 'k', 'e', 'ʔ', 'p', 'i', 'a', 'ʔ', 'ʦ', 'ʰ', 'i', 'n', 'k', 'e', ',', 'l', 'e', 's', 'ɔ', 'g', 'u', 'a', 'n', 'ʦ', 'a', 'i', '.', '_']
tones: [0, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 0, 2, 2, 3, 3, 5, 5, 5, 5, 7, 7, 7, 0, 0]
word2ph: [1, 3, 4, 4, 2, 1, 2, 2, 4, 3, 1, 1]

@acul3
Copy link

acul3 commented May 15, 2024

@jeremy110 yes all my tone are set to 0

now wondering how can i fix this

@jeremy110
Copy link

hello~ @acul3 @jadechip
Sorry, I spent some time looking at that, but since I can't read Thai, I did some online research. I wanted to ask about the symbols from line 266 to 339 in the th_symbols . Are those symbols not IPA?

Also, I looked at the Wiktionary file and found several symbols that seem to represent tones: ˧, ˨˩, ˦˥, ˩˦, and ˥˩. It looks like there are five tones. So, you need to convert these symbols into tones and then add the corresponding number of tones to the 'tones' list based on the number of phones in your phone list.

But I'm confused about lines 5908 to 5910. Which one is correct?

@jadechip
Copy link
Author

@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list.
I've pushed some changes to the g2p function which hopefully addresses this:

def g2p(norm_text):
    tokenized = tokenizer.tokenize(norm_text)
    phs = []
    word2ph = []
    current_word = []
    current_phonemes = []

    for token in tokenized:
        if token.startswith("▁"):  # Start of a new word
            if current_word:
                word_phonemes = " ".join(current_phonemes)
                phs.extend(word_phonemes.split())
                word2ph.append(len(current_phonemes))
                current_word = []
                current_phonemes = []
            current_word.append(token.replace("▁", ""))
        else:
            current_word.append(token)

        if token in punctuation or token in pu_symbols:
            phs.append(token)
            word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(token.replace("▁", ""))
            current_phonemes.extend(phonemes.split())

    if current_word:
        word_phonemes = " ".join(current_phonemes)
        phs.extend(word_phonemes.split())
        word2ph.append(len(current_phonemes))

    # Distribute phonemes to match the number of tokens
    distributed_word2ph = []
    for i, group in enumerate(tokenized):
        if group.startswith("▁"):
            group = group.replace("▁", "")
        if group in punctuation or group in pu_symbols:
            distributed_word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(group)
            distributed_word2ph.append(len(phonemes.split()))

    tone_markers = ['˥', '˦', '˧', '˨', '˩']
    phones = ["_"] + [re.sub(f'[{"".join(tone_markers)}]', '', p) for p in phs] + ["_"]  # Remove tone markers from phones
    tones = extract_tones(phs)  # Extract tones from the original phs list
    word2ph = [1] + distributed_word2ph + [1]

    assert len(word2ph) == len(tokenized) + 2

    return phones, tones, word2ph


def extract_tones(phones):
    tones = []
    tone_map = {
        "˥": 5,  # High tone
        "˦": 4,  # Rising tone
        "˧": 3,  # Mid tone
        "˨": 2,  # Falling tone
        "˩": 1,  # Low tone
    }

    for phone in phones:
        tone = 0
        for marker, value in tone_map.items():
            if marker in phone:
                tone = value
                break
        tones.append(tone)

    return tones

TLDR;

  • it now removes the tone markers from the phonemes in phs using a regular expression and stores the result in the phones list, adding start and end markers ("_").
  • It then extracts the tones from the original phs list using the extract_tones function and stores them in the tones list.
  • It constructs the final word2ph list by adding start and end markers (1) to the distributed_word2ph list and finally, it returns the phones, tones, and word2ph lists.

...I've also updated the test following test case:

def test_g2p():
    text = "ฉันรักเมืองไทย"
    normalized_text = text_normalize(text)
    phones, tones, word2ph = g2p(normalized_text)
    assert phones == ['_', 't͡ɕʰ', 'a', 'n', '', 'r', 'a', 'k̚', '', 'm', 'ɯa̯', 'ŋ', '', 'tʰ', 'aj', '', '.', 'j', 'a', '', '.', '_']
    assert tones == [0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 5, 0]
    assert word2ph == [1, 0, 8, 12, 1]

I think this output makes sense as the output is now similar to yours.

The phones list contains the phonemes corresponding to the input text, excluding the tone markers.
The mapping of tone markers to numeric values seems accurate (4 for ˩˩˦, 5 for ˦˥, 3 for ˧).

The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:

1: Start-of-sequence token
0: No phonemes for the first token (likely punctuation or special symbol)
8: Number of phonemes for the second token ("ฉันรัก")
12: Number of phonemes for the third token ("เมืองไทย")
1: End-of-sequence token

@jadechip
Copy link
Author

About the Thai symbols, the characters from line 266 to 339 are the characters of the Thai alphabet, including numbers.
The remaining lines (340 - 406) were characters that I copied from the Wiktionary file (which I got from here https://github.com/PyThaiNLP/thai-g2p-wiktionary-corpus/tree/main), I am not sure if I should include them in this file (symbols.py) but if I remember correctly I was getting an error if I didn't include them.

@jadechip
Copy link
Author

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔
Maybe I should try looking for a different Grapheme to Phoneme dictionary...

@tchayintr
Copy link

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO.
Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

@jeremy110
Copy link

jeremy110 commented May 16, 2024

text:           
ipa: l e ˥ ˧      s ɔ ˨˩
phones: ['_', 'l', 'e',     's', 'ɔ', '_']
tones: [0, 2, 2,       3, 3, 0]
word2ph: [1, 2,      2, 1]

Perhaps I misled you a bit. Let me clarify using an example.
For '˥ ˧' in my case, it corresponds to 2. Then, with two phones, 'l' and 'e', so the tones correspond to two 2.
For '˩' in my case, it corresponds to 3. Then, with two phones, 's' and 'ɔ', so the tones correspond to two 3.

@jeremy110
Copy link

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

Because I don't know Thai at all, I can't help with the g2p part.
sorry

@tchayintr
Copy link

Note that the phoneme column in the Wiktionary file includes some graphemes with two phonetic transcriptions separated by a comma. This can happen when a word has multiple accepted pronunciations or when the pronunciation can change based on context or regional accents.

For example, ไอดอล (ไอ=i, ดอล=dol) means "idol," where its phonemes include ʔaj˧.dɔl˥˩, ʔaj˧.dɔn˥˩ --> ʔaj˧.dɔl˥˩ (i-dol) or ʔaj˧.dɔn˥˩ (i-don).

@jadechip
Copy link
Author

Thank you all for the insightful feedback.
I have pushed some changes and added another test case:

def test_g2p():
    # Test case for the word "กงล้อ"
    text = "กงล้อ"
    normalized_text = text_normalize(text)
    phones, tones, word2ph = g2p(normalized_text)

    # Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 5, 1]

The TLDR is I now separate the syllables based on ".", then extract the tones and assign values based on this map:

tone_map = {
    "˧": 2,  # Mid tone
    "˨˩": 1,  # Low tone
    "˦˥": 3,  # Rising tone
    "˩˩˦": 4,  # Falling tone
    "˥˩": 5,  # High tone
}

For the tones list, it is calculated similarly to the method @jeremy110 described above, i.e for the word "กงล้อ" results in the following phonemes: ['k', 'o', 'ŋ', '˧', '.', 'l', 'ɔː', '˦˥'].
...as we can see the first 3 phonemes = 'k', 'o', 'ŋ' have an associated tone of '˧' which corresponds to 2,
and the remains 2 phonemes = 'l', 'ɔː' has a tone of '˦˥' which corresponds to 3,
therefore we get a list that looks like this [0, 2, 2, 2, 3, 3, 0].
Note: that the zeroes at the beginning and end represents the "_" special character.
Another note: If no tone marker is found in a group, a default tone value of 2 (mid tone) is assigned to all the phonemes in that group, i.e these two have the same tone:

กง	k o ŋ
กง	k o ŋ ˧

As for the word2ph list, it represents the number of phonemes for each word in the input text, including special tokens.
So using our previous example, we get [1, 5, 1] where the ones are special characters and the 5 represents 'k', 'o', 'ŋ', 'l', 'ɔː'.
Note that the word2ph list has a length equal to the number of words plus 2 (for the special tokens).

@jadechip
Copy link
Author

I should note however that there are surely edge cases and I am not entirely sure if the get_bert_feature function is correct. As always, any feedback and support is very much appreciated 🙏
Happy weekend everyone.

@jeremy110
Copy link

Happy weekend ~~@jadechip
There are still some things that need to be modified

# Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 3, 2, 1] # modified --> 3 mean 3 phones(koŋ), 2 mean 2 phones(l ɔ)

@wannaphong
Copy link

I have a Thai TTS dataset that is open-source dataset. https://huggingface.co/datasets/lunarlist/edited_common_voice

@BankNatchapol
Copy link

I have a Thai TTS dataset that is open-source dataset. https://huggingface.co/datasets/lunarlist/edited_common_voice

Any recommendation for fixing G2P? The pythainlp.transliterate.transliterate("สามารถ", engine="thaig2p") usually gives repeated phonemes.

@wannaphong
Copy link

I have a Thai TTS dataset that is open-source dataset. https://huggingface.co/datasets/lunarlist/edited_common_voice

Any recommendation for fixing G2P? The pythainlp.transliterate.transliterate("สามารถ", engine="thaig2p") usually gives repeated phonemes.

Can you try https://huggingface.co/pythainlp/thaig2p-v2.0?

@tchayintr
Copy link

tchayintr commented May 19, 2024

The issue has become a super long discussion! 😃

@jadechip

I should note however that there are surely edge cases and I am not entirely sure if the get_bert_feature function is correct. As always, any feedback and support is very much appreciated 🙏 Happy weekend everyone.

I did a quick check and test on your commits (jadechip@ffd8f41#diff-8f6f83dc5d5f83888cfad03f6835561fd38ec675fd0b5a07b3911ed38d786487).

I hope there was no mistake from my environment.

I found that it could not pass the assertion assert inputs["input_ids"].shape[-1] == len(word2ph), f"{inputs['input_ids'].shape[-1]}/{len(word2ph)}"

I followed the chinese_mix.py, chinese_bert.py. It seems they use character-based tokens mostly for Chinese while using word-based tokens (IMK) for English, so there is no problem when extracting phonemes because it aligns well with the tokenizer from hfl/chinese-roberta-wwm-ext-large

For Thai, we need to be careful about the size of inputs and word2ph since each Thai BERT-based tokenizer can yield different tokens.

For example,

text = "กงล้อ"     

 Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 5, 1]

The tokenizer should tokenize "กงล้อ" into 1 token (+2 special tokens).

On the other hand,

# Expected output based on the wiktionary entry
    expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
    expected_tones = [0, 2, 2, 2, 3, 3, 0]
    expected_word2ph = [1, 3, 2, 1] # modified --> 3 mean 3 phones(koŋ), 2 mean 2 phones(l ɔ)

The tokenizer should tokenize "กงล้อ" into 2 tokens, กง and ล้อ (+2 special tokens)

Ultimately, I think if we handle it properly (e.g., using the tokenizer before G2P), the get_bert_feature would be fine.

Plus, I am trying to make a rule to combine an initial, vowel, and final phones properly like they did in chinese_mix.py. The result will look like:

text = "Today ฉันกินข้าว Good เป็นไง"
expected_phones = ['_', 't', 'ah', 'd', 'ey', 'ch', 'an', 'k', 'in', 'kh', 'aaw', 'g', 'uh', 'd', 'p', 'en', 'N', 'aj', _]
expected_tones = [0, 7, 8, 7, 9, 5, 5, 1, 1, 3, 3, 7, 9, 7, 1, 1, 1, 1, 0]
rev_expected_tones = [0, 7, 8, 7, 9, 19, 19, 15, 15, 17, 17, 7, 9, 7, 15, 15, 15, 15, 0]  # language_tone_start_map['TH']
expected_word2ph = [1, 4, 2, 2, 2, 3, 2, 2, 1]

Just as an example, please omit the English tones here.

I still need to deal with the alignment of word2ph and tokens tokenized from BERT. If my method works, I will update.

Update:

I still need to deal with the alignment of word2ph and tokens tokenized from BERT. If my method works, I will update.

I convert each word (or a syllable) from text into a token id for encoding. However, this is not the best solution since an unseen word will become an UNK token id.
By the way, I use my Thai BERT tokenizer/encoder containing around 100k words and it has been pre-trained for Thai token classification. At least, it should lessen the unseen word issue a bit.

@jadechip
Copy link
Author

jadechip commented May 19, 2024

Thank you all for your valuable feedback! It's great to see such active collaboration, showcasing the strength of the open source community 💪

Apologies @tchayintr, the assertion was indeed failing. I've pushed some code which should resolve the issue, and the output format should now be correct:

BERT tokens: ['▁', 'กง', 'ล้อ']
Aligning word: กงล้อ
Word phonemes: ['k', 'o', 'ŋ', 'l', 'ɔː']
Word tones: [2, 2, 2, 3, 3]
Final phs: ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
Final tones: [1, 2, 2, 2, 3, 3, 1]
Final word2ph: [1, 3, 2, 1]

Thank you for highlighting your proposed approach, please feel free to run some tests or add changes to my code, I am still not 100% confident that everything is working perfectly, but I am eager to try another training job soon to see if the quality has improved 😄

@BankNatchapol do you mean we should replace the g2p function with another model?
I am not sure how much better https://huggingface.co/pythainlp/thaig2p-v2.0 would be as it is trained on the same dictionary file that I am using (wiktionary-23-7-2022-clean.tsv). Or just let me know if my understanding is incorrect.

Ah nevermind, rereading your reply and I realize you are talking about the pythainlp.tokenize word_tokenize(norm_text, engine="newmm") tokenizer.

@tiebay004
Copy link

tiebay004 commented May 21, 2024

Test results of fine-tuning the Thai text-to-speech model
https://www.youtube.com/watch?v=7sApLg5l2Ps

@tchayintr
Copy link

Test results of fine-tuning the Thai text-to-speech model https://www.youtube.com/watch?v=7sApLg5l2Ps

Thank you for the example, @tiebay004

May I know more details about the libraries or models you used? Coqui?

Honestly, sorry to mention it, but the result doesn't seem as smooth as other languages published on MeloTTS.
I wonder if it is related to MeloTTS? 😭

We are looking forward to the day when the smoothness of Thai TTS is comparable to major languages. However, I feel MeloTTS might have a chance.

@tiebay004
Copy link

Since I had never heard of MeloTTS before, I chose to fine-tune with Coqui. I will try fine-tuning with MeloTTS, but the training takes a long time, and I am using my son's PC for training because he has a GPU with high VRAM.

Since it's the school holidays in Thailand, I should be able to experiment more. As I haven't worked in data before, I might not be able to answer many technical questions about the model. I used to be a developer several years ago.

@jadechip
Copy link
Author

jadechip commented May 30, 2024

Hello everyone. I've been a bit busy lately but I've pushed some changes that I think should fix a lot of the outstanding issues. However I am running into some CUDA issues trying to get the training code to run, which wasn't happening before.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I think this means the code is trying to access elements of a tensor that are outside its valid range but I am not sure where this is happening exactly. Using pdf I was able to pinpoint it down to the forward method of the TextEncoder class in models.py, specifically the line: self.language_emb(language) seems to trigger the error, but I am still not sure why as the language here is defined with a shape similar to the input x.

For added context, this is how my train.list file looks like right now.

/workspace/commonvoice/common_voice_th_25686299.wav|TH-default|TH|อนาคตของการทำงานคือมนุษย์หุ่นยนต์ที่เพิ่มขึ้น|_ ʔ a n aː kʰ o t̚ kʰ ɔː ŋ ก า ร ท ำ ง า น kʰ ɯː m a n u t̚ h u n j o n tʰ iː เ พ ิ ่ ม ข ึ ้ น _|1 1 1 2 2 3 3 3 4 4 4 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 1 1 1 2 2 2 5 5 2 2 2 2 2 2 2 2 2 1|1 10 8 2 5 6 2 9 1
/workspace/commonvoice/common_voice_th_26600610.wav|TH-default|TH|ผู้ประกอบอาชีพ|_ pʰ uː p r a k ɔː p̚ ʔ aː t͡ɕʰ iː p̚ _|1 5 5 1 1 1 1 1 1 2 2 5 5 5 1|1 8 5 1
/workspace/commonvoice/common_voice_th_25686917.wav|TH-default|TH|ฉันหวังว่าทักษะของคุณจะเกาขึ้น|_ t͡ɕʰ a n w a ŋ ว ่ า tʰ a k̚ s aʔ kʰ ɔː ŋ kʰ u n c aʔ k eː า kʰ ɯ n _|1 4 4 4 4 4 4 2 2 2 3 3 3 1 1 4 4 4 2 2 2 2 2 2 2 2 5 5 5 1|1 8 1 5 6 2 2 1 3 1
/workspace/commonvoice/common_voice_th_26665439.wav|TH-default|TH|อันหนึ่งอันเดียวกัน|_ ʔ a n n ɯ ŋ ʔ a n d ia̯w k a n _|1 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1|1 6 3 5 1

@jadechip
Copy link
Author

@jeremy110 should I update the number of tones in symbols.py to 5 as Thai has 5 tones?
num_th_tones = 1
The reason I am asking is because I see most other languages have the tone set to 1, so I am not sure if it has any impact.

@jeremy110
Copy link

Hello @jadechip
Yes, you should update your number of tones to 5.

There are some issue in your train.list:

  1. The _ at the beginning and end should correspond to tone 0.
  2. The tones and word2ph do not match up with our previous discussion.
    To illustrate with my example: three 8's, three 1's, three 2's, two 7's, one 5 ...
/path/audio.wav|F1|TAI|一杯走味的咖啡,|_ ʦ i t p u e ʦ a u b i e k a p i , _|0 8 8 8 1 1 1 2 2 2 7 7 5 1 1 1 1 0 0|1 3 3 3 2 1 2 2 1 1

@jadechip
Copy link
Author

jadechip commented Jun 2, 2024

Thank you @jeremy110, can I just clarify
I have switched the tones for the "_" characters to zeroes.
However I am a bit confused regarding the format of the word2ph list, I should note that I have switched the tokenizer in the g2p method, so the word mapping might be a bit different. Right now it uses the same tokenizer as the get_bert_feature function, which I believe adheres more to the implementation of other languages.
As an example, the phrase ใครเป็นผู้รับ would be tokenized into the following chunks tokenized ['▁ใคร', 'เป็นผู้รับ'].
This results in a word2ph list that looks like this: [1, 3, 8, 1] where the ones are the underscore characters and the 3 = ใ ค ร, and the 8 = เ ป็ น ผ้ ู ร ั บ.
...and there is 13 tones (one assigned for each phoneme).

tokenized ['▁ใคร', 'เป็นผู้รับ']
Final phs: ['_', 'kʰ', 'r', 'aj', 'p', 'e', 'n', 'pʰ', 'uː', 'r', 'a', 'p̚', '_']
Final tones: [0, 2, 2, 2, 2, 2, 2, 5, 5, 3, 3, 3, 0]
Final word2ph: [1, 3, 8, 1]
len(phones) 13
len(tones) 13

Is this correct?

@jeremy110
Copy link

Hello @jadechip
I think you can refer to the French section. Below is an example in French, so you can see that word2ph calculates by converting words into their IPA format phones. In it, sə- corresponds to 3, sɛʁvˈis corresponds to 7, ɡʁatyˈi corresponds to 7, and ɛt corresponds to 2.

French: Ce service gratuit est disponible en chinois simplifié et autres 123.
ipa: - sɛʁvˈis ɡʁatyˈi ɛt disponˈibl ɑ̃n ʃinwˈa sɛ̃plifjˈe e otʁz sˈɑ̃ vˈɛ̃ tʁwˈa.

phones: ['_', 's', 'ə', '-', 's', 'ɛ', 'ʁ', 'v', 'ˈ', 'i', 's', 'ɡ', 'ʁ', 'a', 't', 'y', 'ˈ', 'i', 'ɛ', 't', 'd', 'i', 's', 'p', 'o', 'n', 'ˈ', 'i', 'b', 'l', 'ɑ', '̃', 'n', 'ʃ', 'i', 'n', 'w', 'ˈ', 'a', 's', 'ɛ', '̃', 'p', 'l', 'i', 'f', 'j', 'ˈ', 'e', 'e', 'o', 't', 'ʁ', 'z', 's', 'ˈ', 'ɑ', '̃', 'v', 'ˈ', 'ɛ', '̃', 't', 'ʁ', 'w', 'ˈ', 'a', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 3, 7, 7, 2, 10, 3, 6, 5, 5, 1, 4, 13, 1, 1]

In your case should be

Final phs: ['_', 'kʰ', 'r', 'aj',       'p', 'e', 'n', 'pʰ', 'uː', 'r', 'a', 'p̚', '_']
Final tones: [0, 2, 2, 2,       2, 2, 2, 5, 5, 3, 3, 3, 0]
Final word2ph: [1, 3,      3, 2, 3, 1]

@maryne-ii
Copy link

maryne-ii commented Jun 13, 2024

@jadechip Hello there, how were your training results? Are you still struggling with the pronunciation?

@jadechip
Copy link
Author

Hi @maryne-ii, I believe the pronunciation issues should be resolved, however I am having some issues getting distributed training to work, this is the error I am getting:

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 874668 vs 80644

My understanding is the gloo library is used by PyTorch for collective communication in distributed training, so perhaps the error indicates some kind of mismatch between expected and actual sizes during TCP communication, but I am not sure what in my code - if anything - is causing this...

I should also note I have been using PyTorch > 2.x to train as I was getting other CUDA errors similar to #96.

Training on a single GPU seems to work though.

@jadechip
Copy link
Author

I've had some time to continue working on this and was able to resolve the training issues.
I believe the inconsistencies were caused by the tokenizer I was using. I have now changed to a tokenizer that more aligns with the format expected by the codebase. The format is close to what @jeremy110 suggested apart from the underscore characters in the output. I am not sure if I should remove the underscore characters in the tokenized text before calculating the phs, tones and word2ph values? I am concerned it might cause inconsistencies with the get_bert_feature function later in the pipeline?

The tokenized text ['▁', 'กง', 'ล้อ']
Final phs: ['_', '▁', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
Final tones: [0, 2, 2, 2, 2, 3, 3, 0]
Final word2ph: [1, 1, 3, 2, 1]
bert features shape torch.Size([768, 8])

@jeremy110
Copy link

I think it can be removed because it duplicates the original underscores.

@jadechip
Copy link
Author

Ok, thank you, I will give this a shot.

@steven1565528
Copy link

hello @jadechip
I'm doing something similar to you. I want to train Khmer and Lao, but the dialogue 120 is too long, and I keep getting errors in train.py. I'm planning to write a training tutorial for a small language, can you help me? I'd appreciate it

@jadechip
Copy link
Author

jadechip commented Aug 7, 2024

Thanks for your continued interest and support.
I found quite a few issues in the g2p lookup logic which should now be resolved, and I have changed the tokenizer to clicknext/phayathaibert which I believe should be more aligned with the tokenizers of other languages.

Here is some examples from val.list to illustrate what the processed data now looks like:

/workspace/commonvoice/common_voice_th_26125969.wav|TH-default|TH|บีจีที คอร์ปอเรชั่น|_ _ b ɔː iː t͡ɕ iː tʰ i _ kʰ ɔː p ɔː eː r ɔː t͡ɕʰ ɔː a n ɔː _|0 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 1 1 0|1 1 4 3 1 6 6 1
/workspace/commonvoice/common_voice_th_26843282.wav|TH-default|TH|'กลับมานะ !' เจ้าหนอนตะโกนบอกเธอ|_ ▁ ' k l a p̚ m aː n aʔ ▁ ! ' ▁ t͡ɕ aːw n ɔː n t a k oː n b ɔː k̚ tʰ ɤː _|0 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 5 5 5 4 4 2 2 2 2 2 2 1 1 0|1 4 4 3 3 3 3 3 3 3 1
/workspace/commonvoice/common_voice_th_25665937.wav|TH-default|TH|ผู้กล้าหาญที่สูงศักดิ์ สุภาพบุรุษจะคิดถึงตัวเองเป็นสิ่งสุดท้าย|_ ▁ pʰ uː k l aː h aː n tʰ iː s uː ŋ s a k̚ _ s u pʰ aː p̚ b u r u t̚ c aʔ kʰ i t̚ tʰ ɯ ŋ t ua̯ ʔ eː ŋ p e n s i ŋ s u t̚ tʰ aːj _|0 2 2 2 5 5 5 2 2 2 4 4 5 5 5 4 4 4 0 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 4 4 2 2 2 2 2 2 2 2 2 1 1 1 2 2 0|1 5 4 4 4 1 6 6 6 6 5 5 1
/workspace/commonvoice/common_voice_th_25640308.wav|TH-default|TH|เอด้าที่รักเอสเตอร์ที่รักของฉัน ยินดีต้อนรับ|_ ▁ ʔ eː d ɔː aː tʰ iː r a k̚ eː ʔ ɔː s ɔː eː t ɔː tʰ iː r a k̚ kʰ ɔː ŋ t͡ɕʰ a n _ j i n d iː t ɔː n r a p̚ _|0 2 2 2 2 2 3 3 3 2 2 2 3 3 3 2 2 4 4 4 2 2 2 2 2 3 3 3 4 4 4 0 2 2 2 2 2 2 2 2 2 2 2 0|1 5 5 4 4 4 4 4 1 6 5 1

As you can see there are still some additional underscore characters present ("_") which is part of the tokenizer output, which should represent the pauses between words as Thai do not have spaces. I tried several times remove them but they caused mismatches in the get_bert_feature function, so I decided to keep it for now.

Unfortunately the training loss is still far from ideal and the model is not producing great results. I was able to get it down to around 55 by tweaking the learning rate but I am not sure what is prohibiting it from going lower.

Screenshot 2567-08-06 at 14 33 33

@jadechip
Copy link
Author

jadechip commented Aug 7, 2024

I have pushed all my changes so if anyone would like to give training a go, please let me know how it goes 🤞

@jadechip
Copy link
Author

Here are the results from the different loss components of my lates training run, side by side with the results from @jeremy110 posted. The ones on the left in red are mine. Based on my limited understanding, the g/fm loss is increasing, which seems to suggest the generator is struggling to match the discriminator's features.
Large@1x

@jadechip
Copy link
Author

I will try another run with a different learning rate decay and more aggressive gradient clipping. I might also need to normalize the audio.

@Ishank56
Copy link

@Zengyi-Qin Sounds good, will report back once I have proper training results.

Hey can you send me your info so that I can reach out to you regarding the implementation of this project for a new language.I am facing alot of issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests