-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for PR review: Add support for Thai language #120
Comments
Great job! I planed to work on training a Thai TTS model using MeloTTS too. |
Hi - Thanks for the contribution. We would suggest you first train on the Thai dataset to see if the code works. We haven't had any attempt to train on Thai |
@Zengyi-Qin Sounds good, will report back once I have proper training results. |
Thank you @tchayintr, if you have any recommendations for Thai audio datasets, I would greatly appreciate it! |
@jadechip There are also Thai dialects available at https://github.com/SLSCU/thai-dialect-corpus. However, I recommend collecting clear voice clips and crafting their transcriptions with ASR tools like WhisperX. This way, you can generate a lot of samples, but you may need to fine-tune it for the Thai language 😄. I am reviewing your commits too. They mostly look great 🎆 , but I found some points that need to be clarified. |
@tchayintr this is super helpful and any feedback you have for my code will be greatly appreciated 🙏 |
@Zengyi-Qin are there any additional steps or files needed before training? I am getting the following error: output
...it seems to happen around line 200 in train.py config.json
|
hello @jadechip i'm training for indonesia and malay language changing phonem and bert also after 10 epoch the model doesnt produce any good word, only some noise , some random vowel my data |
It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text. Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. |
Thank you @jeremy110.
...then right after add a check if the n_vocab (len(symbols)) has a different size, and if so update the self.enc_p.embed_tokens with the resized embeddings?
Does that look correct to you? |
hello~ @jadechip Yes, it looks fine as it is. However, in symbols.py, you'll need to make some modifications. If you place your new symbol inside the sorted list and then use the method above, it may result in some symbols having weights that don't match up with the original model. So, I suggest you do it like this. # combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + new_symbols # add new symbols here |
I see, thank you for the heads up @jeremy110 🙏
I'll try running a new training job to evaluate performance with these changes. |
thanks @jadechip and @jeremy110 i'll try it to my environment also,see if works |
btw I am currently training on a subset of Thai commonvoice 13, converted to .wav with a sample rate of 48 kHz. |
hello~ @jadechip My config is basically the same as yours, except my batch size is 6. Perhaps you can increase your learning rate to 9e-4 and see how it performs. Also, I've added a constraint to the clip_grad_value in the code. grad_norm_d = commons.clip_grad_value_(net_d.parameters(), 200)
grad_norm_g = commons.clip_grad_value_(net_g.parameters(), 500) Finally, I'm attaching my tensorboard for reference. (https://drive.google.com/drive/folders/1xPNURmWsmJqwEDHVM8ZsK6CAbuv65ipI?usp=sharing) Additionally, if the silence before and after your audio files is shorter, your g/dur will converge to a smaller value, which will also affect the length of the silence before and after the inference. I'm not sure if the Thai CommonVoice 13 dataset is suitable for training. Also, there's no need to specifically convert it to 48kHz. I remember that the code will resample it. I think you can start by testing whether it can be trained with 10 hours of data from one person. I hope this is helpful for you. |
Thank you for you sharing! Your advice has been super helpful @jeremy110 🙏 |
Hmm trained for longer with different hyperparameters but so far the results are not much better, something might be wrong with my code. |
yeah me too longer training,,the voice is clearer and similar, but cant pronounce a single word maybe phenomizer problem ,idk |
hello @jadechip @acul3 #error
phones: ['_', 'k', 'e', 'ʔ', '˧', 'p', 'i', 'a', 'ʔ', '˧', 'ʦ', 'ʰ', 'i', 'n', '˦', '˦', 'k', 'e', '˦', '˦', ',', 'l', 'e', '˥', '˧', 's', 'ɔ', '˨', '˩', 'g', 'u', 'a', 'n', '˩', '˧', 'ʦ', 'a', 'i', '˧', '˧', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 4, 5, 6, 4, 1, 4, 4, 6, 5, 1, 1] #correct
phones: 28 ['_', 'k', 'e', 'ʔ', 'p', 'i', 'a', 'ʔ', 'ʦ', 'ʰ', 'i', 'n', 'k', 'e', ',', 'l', 'e', 's', 'ɔ', 'g', 'u', 'a', 'n', 'ʦ', 'a', 'i', '.', '_']
tones: [0, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 0, 2, 2, 3, 3, 5, 5, 5, 5, 7, 7, 7, 0, 0]
word2ph: [1, 3, 4, 4, 2, 1, 2, 2, 4, 3, 1, 1] |
@jeremy110 yes all my tone are set to 0 now wondering how can i fix this |
hello~ @acul3 @jadechip Also, I looked at the Wiktionary file and found several symbols that seem to represent tones: ˧, ˨˩, ˦˥, ˩˦, and ˥˩. It looks like there are five tones. So, you need to convert these symbols into tones and then add the corresponding number of tones to the 'tones' list based on the number of phones in your phone list. But I'm confused about lines 5908 to 5910. Which one is correct? |
@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list.
TLDR;
...I've also updated the test following test case:
I think this output makes sense as the output is now similar to yours. The phones list contains the phonemes corresponding to the input text, excluding the tone markers. The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:
|
About the Thai symbols, the characters from line 266 to 339 are the characters of the Thai alphabet, including numbers. |
About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 |
I appreciate your hard work. 🥇 One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):
In case, you missed some of them. 😄 While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P. |
text: 禮 數
ipa: l e ˥ ˧ s ɔ ˨˩
phones: ['_', 'l', 'e', 's', 'ɔ', '_']
tones: [0, 2, 2, 3, 3, 0]
word2ph: [1, 2, 2, 1] Perhaps I misled you a bit. Let me clarify using an example. |
Because I don't know Thai at all, I can't help with the g2p part. |
Note that the phoneme column in the Wiktionary file includes some graphemes with two phonetic transcriptions separated by a comma. This can happen when a word has multiple accepted pronunciations or when the pronunciation can change based on context or regional accents. For example, ไอดอล (ไอ=i, ดอล=dol) means "idol," where its phonemes include ʔaj˧.dɔl˥˩, ʔaj˧.dɔn˥˩ --> ʔaj˧.dɔl˥˩ (i-dol) or ʔaj˧.dɔn˥˩ (i-don). |
Thank you all for the insightful feedback.
The TLDR is I now separate the syllables based on ".", then extract the tones and assign values based on this map:
For the tones list, it is calculated similarly to the method @jeremy110 described above, i.e for the word "กงล้อ" results in the following phonemes: ['k', 'o', 'ŋ', '˧', '.', 'l', 'ɔː', '˦˥'].
As for the word2ph list, it represents the number of phonemes for each word in the input text, including special tokens. |
I should note however that there are surely edge cases and I am not entirely sure if the |
Happy weekend ~~@jadechip # Expected output based on the wiktionary entry
expected_phones = ['_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
expected_tones = [0, 2, 2, 2, 3, 3, 0]
expected_word2ph = [1, 3, 2, 1] # modified --> 3 mean 3 phones(koŋ), 2 mean 2 phones(l ɔ) |
I have a Thai TTS dataset that is open-source dataset. https://huggingface.co/datasets/lunarlist/edited_common_voice |
Any recommendation for fixing G2P? The pythainlp.transliterate.transliterate("สามารถ", engine="thaig2p") usually gives repeated phonemes. |
Can you try https://huggingface.co/pythainlp/thaig2p-v2.0? |
The issue has become a super long discussion! 😃
I did a quick check and test on your commits (jadechip@ffd8f41#diff-8f6f83dc5d5f83888cfad03f6835561fd38ec675fd0b5a07b3911ed38d786487). I hope there was no mistake from my environment. I found that it could not pass the assertion I followed the For Thai, we need to be careful about the size of For example,
The tokenizer should tokenize "กงล้อ" into 1 token (+2 special tokens). On the other hand,
The tokenizer should tokenize "กงล้อ" into 2 tokens, กง and ล้อ (+2 special tokens) Ultimately, I think if we handle it properly (e.g., using the tokenizer before G2P), the Plus, I am trying to make a rule to combine an initial, vowel, and final phones properly like they did in
Just as an example, please omit the English tones here. I still need to deal with the alignment of word2ph and tokens tokenized from BERT. If my method works, I will update. Update:
I convert each word (or a syllable) from |
Thank you all for your valuable feedback! It's great to see such active collaboration, showcasing the strength of the open source community 💪 Apologies @tchayintr, the assertion was indeed failing. I've pushed some code which should resolve the issue, and the output format should now be correct:
Thank you for highlighting your proposed approach, please feel free to run some tests or add changes to my code, I am still not 100% confident that everything is working perfectly, but I am eager to try another training job soon to see if the quality has improved 😄 @BankNatchapol |
Test results of fine-tuning the Thai text-to-speech model |
Thank you for the example, @tiebay004 May I know more details about the libraries or models you used? Coqui? Honestly, sorry to mention it, but the result doesn't seem as smooth as other languages published on MeloTTS. We are looking forward to the day when the smoothness of Thai TTS is comparable to major languages. However, I feel MeloTTS might have a chance. |
Since I had never heard of MeloTTS before, I chose to fine-tune with Coqui. I will try fine-tuning with MeloTTS, but the training takes a long time, and I am using my son's PC for training because he has a GPU with high VRAM. Since it's the school holidays in Thailand, I should be able to experiment more. As I haven't worked in data before, I might not be able to answer many technical questions about the model. I used to be a developer several years ago. |
Hello everyone. I've been a bit busy lately but I've pushed some changes that I think should fix a lot of the outstanding issues. However I am running into some CUDA issues trying to get the training code to run, which wasn't happening before.
I think this means the code is trying to access elements of a tensor that are outside its valid range but I am not sure where this is happening exactly. For added context, this is how my train.list file looks like right now.
|
@jeremy110 should I update the number of tones in symbols.py to 5 as Thai has 5 tones? |
Hello @jadechip There are some issue in your train.list:
/path/audio.wav|F1|TAI|一杯走味的咖啡,|_ ʦ i t p u e ʦ a u b i e k a p i , _|0 8 8 8 1 1 1 2 2 2 7 7 5 1 1 1 1 0 0|1 3 3 3 2 1 2 2 1 1 |
Thank you @jeremy110, can I just clarify
Is this correct? |
Hello @jadechip French: Ce service gratuit est disponible en chinois simplifié et autres 123.
ipa: sə- sɛʁvˈis ɡʁatyˈi ɛt disponˈibl ɑ̃n ʃinwˈa sɛ̃plifjˈe e otʁz sˈɑ̃ vˈɛ̃ tʁwˈa.
phones: ['_', 's', 'ə', '-', 's', 'ɛ', 'ʁ', 'v', 'ˈ', 'i', 's', 'ɡ', 'ʁ', 'a', 't', 'y', 'ˈ', 'i', 'ɛ', 't', 'd', 'i', 's', 'p', 'o', 'n', 'ˈ', 'i', 'b', 'l', 'ɑ', '̃', 'n', 'ʃ', 'i', 'n', 'w', 'ˈ', 'a', 's', 'ɛ', '̃', 'p', 'l', 'i', 'f', 'j', 'ˈ', 'e', 'e', 'o', 't', 'ʁ', 'z', 's', 'ˈ', 'ɑ', '̃', 'v', 'ˈ', 'ɛ', '̃', 't', 'ʁ', 'w', 'ˈ', 'a', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 3, 7, 7, 2, 10, 3, 6, 5, 5, 1, 4, 13, 1, 1] In your case should be Final phs: ['_', 'kʰ', 'r', 'aj', 'p', 'e', 'n', 'pʰ', 'uː', 'r', 'a', 'p̚', '_']
Final tones: [0, 2, 2, 2, 2, 2, 2, 5, 5, 3, 3, 3, 0]
Final word2ph: [1, 3, 3, 2, 3, 1] |
@jadechip Hello there, how were your training results? Are you still struggling with the pronunciation? |
Hi @maryne-ii, I believe the pronunciation issues should be resolved, however I am having some issues getting distributed training to work, this is the error I am getting:
My understanding is the gloo library is used by PyTorch for collective communication in distributed training, so perhaps the error indicates some kind of mismatch between expected and actual sizes during TCP communication, but I am not sure what in my code - if anything - is causing this... I should also note I have been using PyTorch > 2.x to train as I was getting other CUDA errors similar to #96. Training on a single GPU seems to work though. |
I've had some time to continue working on this and was able to resolve the training issues.
|
I think it can be removed because it duplicates the original underscores. |
Ok, thank you, I will give this a shot. |
hello @jadechip |
Thanks for your continued interest and support. Here is some examples from val.list to illustrate what the processed data now looks like:
As you can see there are still some additional underscore characters present ("_") which is part of the tokenizer output, which should represent the pauses between words as Thai do not have spaces. I tried several times remove them but they caused mismatches in the get_bert_feature function, so I decided to keep it for now. Unfortunately the training loss is still far from ideal and the model is not producing great results. I was able to get it down to around 55 by tweaking the learning rate but I am not sure what is prohibiting it from going lower. |
I have pushed all my changes so if anyone would like to give training a go, please let me know how it goes 🤞 |
Here are the results from the different loss components of my lates training run, side by side with the results from @jeremy110 posted. The ones on the left in red are mine. Based on my limited understanding, the g/fm loss is increasing, which seems to suggest the generator is struggling to match the discriminator's features. |
I will try another run with a different learning rate decay and more aggressive gradient clipping. I might also need to normalize the audio. |
Hey can you send me your info so that I can reach out to you regarding the implementation of this project for a new language.I am facing alot of issues |
I have created the following PR to add support for Thai language.
I am in the process of creating a dataset to train the model but would love a PR review of the code first to make sure I am on the right track.
Thank you!
#117
The text was updated successfully, but these errors were encountered: