Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English synthsis is good, how about Chinese? #58

Closed
lucasjinreal opened this issue Oct 26, 2018 · 50 comments
Closed

English synthsis is good, how about Chinese? #58

lucasjinreal opened this issue Oct 26, 2018 · 50 comments

Comments

@lucasjinreal
Copy link

Does this got any blog or attempt on do tts on Chinese?

@erogol
Copy link
Contributor

erogol commented Oct 26, 2018

Never tried sorry but it'd be interesting to see.

@dvbfuns
Copy link

dvbfuns commented Nov 21, 2018

Chinese is also good in this model. And compared with other tacotron model, this model can get clear voice with less time. in my test, with same dataset, 10000 steps can synthesis the voice which the quality similar to tacotron 50000 steps.

@erogol
Copy link
Contributor

erogol commented Nov 21, 2018

@dvbfuns great to hear that. Do you have any samples to share? It'd be great to put into the main page, if you don't mind.

@lucasjinreal
Copy link
Author

@dvbfuns Which training dataset are u using? A Chinese version TTS would be good to enhance this great repo

@dvbfuns
Copy link

dvbfuns commented Nov 22, 2018

@erogol would like to share the samples, just I have problem to access soundcloud.com, any suggestions to do the sharing? or I can share them to you with e-mail ?

@erogol
Copy link
Contributor

erogol commented Nov 22, 2018

@dvbfuns e-mail would work egolge@mozilla.com . Thanks for your help.

@erogol
Copy link
Contributor

erogol commented Nov 22, 2018

@dvbfuns you might even consider PR your Chinese changes. I agree @dvbfuns, that would be great addition.

@dvbfuns
Copy link

dvbfuns commented Nov 23, 2018

@erogol , already send your mail with the model and samples, please kindly refer.

@lucasjinreal
Copy link
Author

@erogol Would u like update into README or model zoo? @dvbfuns BTW, did u using your own labeling dataset?

@erogol
Copy link
Contributor

erogol commented Nov 23, 2018

@jinfagang I can put whatever @dvbfuns can provide. But also understand if he doesn't like to share the model.

@lucasjinreal
Copy link
Author

@erogol Could u resend the voice samples to me? I'd like to check the performance of Chinese result. jinfagang19@gmail.com , thanks in advance

@erogol
Copy link
Contributor

erogol commented Nov 24, 2018

@jinfagang anything I've will be posted on Github as soon as I receive.

@erogol
Copy link
Contributor

erogol commented Dec 14, 2018

I close this due to inactivity. Feel free to reopen.

@erogol erogol closed this as completed Dec 14, 2018
@mazzzystar
Copy link

mazzzystar commented Jan 17, 2019

@jinfagang @erogol
Hi! I'd like to share some Chinese results. You can download demo.zip

And still, Decoder stopped with 'max_decoder_steps will sometimes happen when infer some long sentences(>20). Glad to see if you know good way to handle it.

@erogol
Copy link
Contributor

erogol commented Jan 17, 2019

@mazzzystar Thanks for sharing your results. They sound to me quite okay but I am not a Chinese speaker.

I'd suggest you to replace the stop token layer with a RNN as it was in the previous versions. RNN based model is larger but it is more reliable. Here is a snapshot:

class StopNet(nn.Module):
    r"""
    Predicting stop-token in decoder.
    
    Args:
        r (int): number of output frames of the network.
        memory_dim (int): feature dimension for each output frame.
    """
    
    def __init__(self, r, memory_dim):
        super(StopNet, self).__init__()
        self.rnn = nn.GRUCell(memory_dim * r, memory_dim * r)
        self.relu = nn.ReLU()
        self.linear = nn.Linear(r * memory_dim, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, inputs, rnn_hidden):
        """
        Args:
            inputs: network output tensor with r x memory_dim feature dimension.
            rnn_hidden: hidden state of the RNN cell.
        """
        rnn_hidden = self.rnn(inputs, rnn_hidden)
        outputs = self.relu(rnn_hidden)
        outputs = self.linear(outputs)
        outputs = self.sigmoid(outputs)
        return outputs, rnn_hidden

@mazzzystar
Copy link

mazzzystar commented Jan 17, 2019

@erogol
Thanks for your reply, I will try out.
Actually in Chinese, it's really important to know where to pause and how long should it pause in a single sentence , normally pause happens several times and if all pause are correct, the result will be considered as "good naturalness" . And as far as I know, my model based on mozilla-TTS outperform most current Mandarin Chinese TTS in naturalness, thanks for your work !

One part I think need to be improved is that, the voice texture is still a little bit "electronic" and unlike real human, though it's good enough. I may start to focus on this part and try out some methods, such as different Vocoder, or other attention method. BTW have you considered of using Transformer to replace current RNN part ? I noticed that more and more people prefer Transformer than RNN after BERT came out.

Finally, thanks again for your great work !

@erogol
Copy link
Contributor

erogol commented Jan 17, 2019

@mazzzystar
Thanks for your words :). Yeah I'd guess things would be much better, if we could combine TTS with a neural vocoder. It is in progress but, we need sometime to solve some internal technicalities before we continue. You could also try World vocoder. There is a discussion about it in issues as well with some example scripts to help you. It shouldn't be so hard.

I'd say attention is more about laying the right pronunciation but naturalness is a matter of the vocoder. You can also try attention windowing implemented in dev branch layers/attention.py. It would give better monotonic attention with less noise. Based on the window size you can also barely define the pace of the speech. You can also try to multiply attention weights with ~4 before applying normalization. That would also lead to more clear alignment.

When it comes to BERT, I've not tried yet. One problem with BERT, it requires more memory compared to RNN. Therefore it might be edgy in low budget systems to train which I prefer to stay away. However, if you like to try, I am here to help.

Thanks again!

@lucasjinreal
Copy link
Author

@mazzzystar Your Chinese result is really impressive! May I ask which Chinese voice corpus did you use? Or which way did u organize your data?

@mazzzystar
Copy link

mazzzystar commented Jan 18, 2019

@jinfagang
Sorry, I can't tell you the detail for it's one of my current work, and may hurts company's interest. Hope you can understand. I'm here just to let you know mozilla-TTS works well on Chinese synthesis.

@OswaldoBornemann
Copy link

@mazzzystar hello man, the demo.zip file seems not work. How could i download it ?

@OswaldoBornemann
Copy link

@mazzzystar @jinfagang @dvbfuns @erogol yes, i also tried it out in chinese corpus. The model just get a better alignment than the other tacotron2 project, especially nvidia/tacotron2. But I haven't tried to listen the voice synthsis effectiveness

@lucasjinreal
Copy link
Author

@tsungruihon Which repo are u using?

@OswaldoBornemann
Copy link

@jinfagang just use mozilla TTS

@lucasjinreal
Copy link
Author

@tsungruihon Sorry, I mean, which corpus

@OswaldoBornemann
Copy link

@jinfagang audio that post in some app.

@puppyapple
Copy link

puppyapple commented Nov 27, 2019

Hello @erogol, thanks for you great work! I'm new to TTS domain and trying to adapt your repo to some Chinese dataset(10000 sentences, 12H). Training is still ongoing but seems promising. I have several doubts when looking into details, hope that you could give me some advices:

  • I noticed that for character(use_phonemes=false) training mode, we don't have an 'enable_eos_bos' option to add end token to the end of sentences which I saw a lot in some other discussions like Nvidia/Tacotron2, but just let the model learn through stopnet, so in this case should I always waiting for the stop loss converges to zero? For now, my alignment has always gaps after the stop point like showed below(along with the 'Decoder stopped with 'max_decoder_steps' warning, so I can assume that the model does not learn when to stop. Why not add stop token here to help?)
    image
  • For the training time, I saw your shared pretrained models with LJSpeech on GoogleDrive where you trained 160k steps with 16 batch size. So my question is, should we care about the eval loss to stop training or just let the training continue so long as the training loss improves(overfitting?)
  • When I try with repo of NVIDIA/Tacotron2 there is problem with the restore training(loss spike after first step and model starts from scratch), which I found is probably related to the Adam optimizier, have you ever encountered such issue?
    Thanks!

@erogol
Copy link
Contributor

erogol commented Dec 2, 2019

  • it should learn to stop after enough training and it is more reliable than using eos. You can also try eos , otherwise.

  • eval or train loss does not exactly show the final performance. The best is to check yourself for the best sounding model.

  • in my implementation fine-tuning should work flawlessly.

@puppyapple
Copy link

puppyapple commented Dec 2, 2019

@erogol thanks for the reply, now I'm training without forward attention and the problem in the figure above seems dissapeared for now, I will wait for longer to see what il will become. For the fine-tuning, unfortunately I don't even have the chance to get a loss spike because I could not launch restore(or continue) traing due to the issue that I described here #318. Any idea for this? I tried many modifications but none of them worked.

@puppyapple
Copy link

puppyapple commented Dec 11, 2019

@erogol Hello erogol, thanks for your great work and replies for my questions. I finally succeeded to train a tacotron2 model with a public Chinese dataset, as well as a WaveRNN using predicted mels. The results sound good. I'd like to share some audio samples here in a few days.
And following #26, I'm now trying to finetune the tacotron2 with 'BN' prenet, the improvement of loss is significant! Nearly the same as the figures you shared. The training is still on going and I will compare the audios created after.
Just a small doubt, after finetuning with 'BN' prenet, is it necessary to retrain(or finetune) my WaveRNN model with the new predicted mels? Thanks!

@erogol
Copy link
Contributor

erogol commented Dec 11, 2019

@puppyapple Great to hear that !!

Your question ... if you train wavernn with the final mel specs you are likely to get better results. However, without that it should sound good enough.

@puppyapple
Copy link

@erogol OK. Then I think I will give it a try anyway! 😁

@puppyapple
Copy link

Here are two samples from my Tacotron 2 + WaveRNN using dev branch of this repo, thanks for your work! The alignment is showed in figure(forward attention is enabled during inference). It seems the 'target' parameter has significant impact on voice quality: the audio with target=4000 sounds 'trembling' than the other one with target=22000 which is much more 'clean'.
samples.zip
alignment

@lucasjinreal
Copy link
Author

@puppyapple Amazing, the result is the most good I have ever seen on Chinese dataset. Will u share some branch on this?

@puppyapple
Copy link

@jinfagang Thanks, nothing special has been added. You could check my forked code which are all from @erogol 's work. Few modifications are made to fit Chinese data(Biaobei 10000)

@OswaldoBornemann
Copy link

OswaldoBornemann commented Dec 17, 2019

@puppyapple would you mind sharing your config.json file ?

@lucasjinreal
Copy link
Author

@puppyapple On which branch? How to prepare for training on Biaobai?

@puppyapple
Copy link

@jinfagang @tsungruihon All is in dev branch. For Biaobei dataset I have not made any extra preparations, just followed the implementation in erogol's and got positive results. But still, this public dataset is too small and is lack of punctuation symbols in the scripts, not all sentences synthesised are as natural as showed in my samples, some have also bad or wrong punctuations. In general the results are not bad.

@OswaldoBornemann
Copy link

OswaldoBornemann commented Dec 18, 2019

@puppyapple thanks my friend. It seems that you use Tacotron2 with location sensitive attention instead of forward attention, according to the config.json from your dev branch.

@puppyapple
Copy link

@tsungruihon yes and I also finetuned with BN prenet like erogol described in #26.

@shad94
Copy link

shad94 commented Dec 18, 2019

@puppyapple, I got two questions, since I am new to the project:

  1. Have you changed content of files in TTS/tests for purpose of Chinese? The same with TTS/mozilla-us-phonemes
  2. How to generate encoder VS decoder graph?
    Thank you

@puppyapple
Copy link

@shad94

  1. I didn't use TTS/tests for testing, but with the benchmark jupyter notebook in TTS/notebooks with some modifications;
  2. It's already implemented by erogol in the logger class.

@OswaldoBornemann
Copy link

@puppyapple . Thanks my friend.

@WhiteFu
Copy link

WhiteFu commented Jan 7, 2020

@puppyapple I find the audio that you offer is 48000Hz. your sample_rate in config.json is 48000? Because upsampling(22kHz -> 48kHz) doesn't have high frequency details .

@puppyapple
Copy link

@WhiteFu Yes, since the Biaobei dataset is 48khz, I just keep it the way as it is, without any upsampling.

@WhiteFu
Copy link

WhiteFu commented Jan 7, 2020

Thank you for your reply. I will check more details in you fork branch:)

@chynphh
Copy link

chynphh commented Jan 12, 2020

@erogol @puppyapple Hi, I am a newbie in this area. I'm trying to use TTS2 to train a Chinese muti-speaker model. Here are my samples. And I have some questions.

  1. Generated audio files are understandable but very noisy(the samples are in samples/phonemes/120Kstep/). I done not use any vocode(GL or WaveRNN). Is this normal?
    How to deal with this problem? Using a vocode or any other idea?
  2. For Chinese, is it better to use pinyin or phonemes? When I use phonemes, some tones are not accurate, like a non-native speaker speaks Chinese. My model using pinyin has not yet converged.
  3. Why is there a big difference between training and testing? I set the same parameters for the function synthesis. The results of test-text in training(train.py) are much better than in testing(Benchmark.ipynb). The training time samples are in samples/without_phonemes(use pinyin)/29037steps and samples/without_phonemes(use pinyin)/30886steps" The testing time samples are in samples/without_phonemes(use pinyin)/30000steps.
  4. Is there a big difference between training WaveRNN with raw wav files or TTS2 model? Which is better? Is there a guide to training the WaveRNN model?

The format of the file name is {text}-{speaker id}-{train steps}.
Thank you very much! :)

@puppyapple
Copy link

@chynphh Since I'm also fresh in TTS domain, I can only try to answer you question from my own point of view, which may be not correct.

  1. It is sure that using a vocoder will give better audio quality. In this repo, erogol has already implemented GL to generate test audio for tensorboard display, have you listened to the result? I've tried both WaveRNN and ParallelWaveGAN, WaveRNN could get high quality but with large 'overlap' parameters which will increase inference time. ParallelWaveGAN result is a little noisy but not quite obvious, however it is much more faster.
  2. In my own test, pinyin is sufficient to get good pronunciation.
  3. 30k steps seems far from enough, you could wait longer.
  4. I have not tried Ground Truth mel from raw wav file for WaveRNN, Tacotron 2 generated mels seem to work well. You can try to understand erogol's implementation and give it a try, for me it's clear enough.

@chynphh
Copy link

chynphh commented Jan 16, 2020

@puppyapple thanks for your reply!

After my experiments, using Pinyin is indeed better than phonemes.
I trained Tacotron 2 with 240K steps. The results were good but still a bit noisy.
Now, I'm trying to train a WaveRNN model. I tried to use the mels generated by Tacsotron 2, but it cannot work with the raw wav file. It seems to be caused by a mismatch between the raw wav file and mels generated by Tacsotron 2( #26 (comment)).
So, I trained WaveRNN with raw wav files and Ground Truth mels. Until now, it hasn't worked with 180K steps.
When training WaveRNN with mels from Tacotron2, which wavs do you use, the ground truth wavs file or the wavs generated by Tacotron2?

@puppyapple
Copy link

puppyapple commented Jan 16, 2020

@chynphh mels generated by trained Tacotron2 model as input and ground truth audio files as target. Have you extracted mels using the right config? You could refer to the benchmark notebook in this repo to do that, maybe a few modifications are needed. For #26 (comment), maybe try to locate the out of range sample to find out the reason(like 'hop_length' mismatch, etc.)

@chynphh
Copy link

chynphh commented Jan 16, 2020

@puppyapple Thanks for your suggestions and answers, I will double check my code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants