-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use WORLD vocoder #9
Comments
Since I cannot find a handy WORLD interface for feature extraction, it has no progress. If anyone finds something out pls let me know. Simple WaveNet is really slow to be used especially for the test time. New Parallel WaveNet paper seems interesting but it is also a big work to train proposed teacher-student Framework. Before starting this phase, I expect to finish some other experiments. |
I've used this python wrapper once: https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder Alternatively there is this helper script in Merlin that might be useful: https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/vocoder/world/extract_features_for_merlin.py |
@m-toman thnks for pointers. How was the experience with the first one. I've tried it on a single audio file but synthesis on the extracted features was creating aberrations on the generated audio. Do you also have any results with World? Would you say things are better with it? |
@erogol Like in the merlin codebase I also replaced F0 extraction with REAPER, which was a huge improvement, at least over the older method in WORLD - Dio. I've only briefly tried the internal spectrum compression method but did not have any luck with it and resorted back to using MGCs/MFCCs. It's fast enough for live synthesis on mobile (especially via the streaming implementation) It generally sounds pretty good, buts I suspect we can get better results with neural vocoders in future (although they typically use the acoustic features for conditioning as well). But the fast Wavenet method seems a bit bloated to me. |
@m-toman thanks for sharing your bits&pieces. I thinks soon I gonna try WORLD and put the results here. |
Wavenet with these values impractical to use r9y9/wavenet_vocoder#28 (comment) |
Hi @erogol , I'd like to help you integrate the WORLD vocoder. It seems we need a way to extract the params needed for the vocoder. One option for extraction and synthesis could be import pyworld as pw
import soundfile as sf
# numpy_array, sample_rate
x, fs = sf.read(path)
# f0, spectrogram, aperiodicities
f0, sp, ap = pw.wav2world(x, fs)
# numpy_array
y = pw.synthesize(f0, sp, ap, fs, pw.default_frame_period)
sf.write("/path/to/audio.wav", y, fs) Since the linear spectrogram is already being predicted by the post net, it seems we need to also predict |
@stevemurr thanks for the post. I also just start to use pyworld and test its performance in different settings. I can share the notebook of these experiments if you are also interested. I did extract the features for WORLD and now writing a dataloader. After I finish this, I can push the branch to your provision. I think what you suggest it possible but I am not sure about the quality. Only way to check is to try :). It would be worthy since WORLD is much faster compared to Griffin-Lim with no quality sacrifice in general. The only down side is to extract the features at first for any dataset since it is too slow to be done on the fly. |
@erogol Intriguing! I'd love to check out the notebook with your comparisons. I'm very curious about the viability of WORLD as an intermediate choice as a vocoder until fast neural vocoders become ubiquitous. I've achieved comparable results to Google's paper using https://github.com/Rayhane-mamah/Tacotron-2 and r9y9's WaveNet implementation. The downside being I had to train the wave net to roughly 1.6 million steps which took a couple weeks on a single 1080. The upside being it's useful for offline synthesis. |
@stevemurr FYInterest: https://gist.github.com/erogol/92cdeca0e12c9ea3e79e518111b354c7 |
What I observe, encoding with f0 tunning by |
Data loader added... |
@erogol Thanks for the notebook examples and I agree with your assessment |
@stevemurr I coded the train.py with a small lazy testing. Now I try to update the network architecture to allocate WORLD features. One of the questions, how to replace intermediate mel-spectrogram prediction. Do you think it makes sense to use mel-scale spectral envelope at that stage? I also try to normalize WORLD features in a non-disruptive way for efficient training. Any ideas on this? |
@erogol Sorry for the late reply! I'd like to share a tutorial from one of r9y9's repos - he describes a traditional TTS process using linguistic features and acoustic features to build a duration model and an acoustic model - this can hopefully serve as a guide in targeting WORLD. Since we are using the
Following this I assume we should attempt mean/variance normalization for the WORLD features - a function is provided for this in his
Is there a difference between spectrogram and spectral envelope? I have an idea for using multiple encoder and decoders that I wanted to get your thoughts on - hopefully it's not too crazy sounding :). Currently the network predicts a Mel spectrogram which works in discovering correct alignments with the text. What if we leave the current architecture as is but create a series of |
@stevemurr unfortunately I am busy with the other experiments for now, will return to this thread after a while. |
Hi,@stevemurr,I am thinking another way to integerate tacotron with WORLD vocoder.What if I just replace the Mel spectogram with WORLD parameters(using pyworld to extract them) and let the model predict these parameters directly? I am not sure if it can work and I am going to give it a try.Dose this idea ever occur to you and have you succeed in integerating tacotron with WORLD vocoder? Thanks~ |
@Maxxiey I suspect this is a reasonable approach, after all that's what systems like Merlin do, or also the a bit older papers by Heiga Zen (although with the Vocaine vocoder) - just put mcep, f0, bap (potentially V/UV flag depending on which F0 extractor you use) in a single vector, usually with delta and delta-delta features (+MLPG afterwards, not sure if necessary with Taco). Other options: |
@m-toman Thank you very much for your reply.Your opnion encourages me,a newbie in TTS field, a lot. |
@m-toman thanks for visiting TTS. Yes I was planing to do that. Even I wrote a script to extract all WORLD features but could not find time to go further. DeepVoice3 paper uses WORLD and reports very close results to NN based vocoder. It would be more preferential over WaveNet since it'd be easier to train and perform inference. I believe WORLD would give better results, at least without network in the loop, WORLD performs better recovery then Griffin-Lim algorithm. This branch is outdated but it might be useful for you to take a look at https://github.com/mozilla/TTS/tree/world_new/scripts |
@m-toman How was the quality and run-time with WaveRNN ? |
@Maxxiey you can also checkout this https://gist.github.com/erogol/92cdeca0e12c9ea3e79e518111b354c7 |
@erogol I only tried the adapted model by fatchord... While I don't have actual numbers, I would say it was probably about 1 minute for a longer sentence instead of 10 minutes with the Wavenet implementation by r9y9. On a GTX 1080Ti. The GTA samples produced during the training were pretty good (https://www.dropbox.com/sh/2gtunx8d1r92fqb/AADh9CJEtvHnQ7YlwNClk8X5a?dl=0&m=) , I wasn't that happy with the actual end to end synthesis results, but perhaps it was the fault of the Tacotron model - didn't train it very long, perhaps I can produce a couple samples soon. My main issue is that so much more work is going on with Wavenet. And on the other side of the spectrum - we don't have to train WORLD for every speaker and it also easily compiles and runs fast enough on lots of platforms. That's why I think it would certainly be interesting to try out. By chance, do you know what the main differences in your tacotron implementation vs https://github.com/Rayhane-mamah/Tacotron-2 are? (except of course Tensorflow vs PyTorch and the Wavenet integration) |
@erogol Thanks, this will come in handy~ |
@m-toman I've not checked https://github.com/Rayhane-mamah/Tacotron-2 but I don't use Tacotron2 model. The only similarity is to use Location Sensitive Attention the rest is the same old Tacotron. I found no improvement to use other alternative layers proposed in TC2. It worked for them probably since they end the system with WaveNet instead of GL. |
@tsungruihon it looks erroneous to me. Are you sure your input and output set correctly at any part of your network? |
@erogol @begeekmyfriend here is the latest alignment graph. Yesterday i have tried to generate audio using |
@tsungruihon it looks better. How many epochs? |
@erogol @begeekmyfriend nearly
Below is how i extract
|
@erogol @begeekmyfriend i just plot the |
@tsungruihon Maybe you need some post processing before synthesis. See my code. By the way, you can do some resynth with WORLD vocoder only and record the feature values for checking out. https://github.com/Rayhane-mamah/Tacotron-2/files/2713952/world_vocoder_resynth_scripts.zip Here is resynth script. |
@begeekmyfriend sure my friend. I followed your code like below
But when i ran |
@tsungruihon There is core dump in |
@erogol @begeekmyfriend i would like to share the prediction of the model, |
|
Thanks my friend. Really grateful and appreciated.! @begeekmyfriend |
WORLD feature extraction from GanTTS helps convergence. Any feedback is welcome! begeekmyfriend/Tacotron-2@e40a7b7 |
Here is biaobei mandarin demo from T2 + WORLD. The f0 feature value prediction is tough for this model. |
@begeekmyfriend |
The part of feature extraction is derived from gantts project. |
Never recommend this vocoder anymore. The feature values are too sensitive to be predicted. In my humble opinion mel spectrogram plus neural network vocoder such is WaveRNN is the most proper solution for TTS so far. I give up. By the way the implementation of WORLD + Tacotron2 is still kept in my fork branch |
@begeekmyfriend did you ever tried using LPCNET+Tacotron2? |
@tsungruihon I saw you comment on how to connect LPCNET and Tacotron, this guide maybe will be useful for you MlWoo/LPCNet@324b212 |
@carlfm01 thanks a lot.! |
I have tried WaveRNN. |
Test on robustness of smoothing features from WORLD vocoder(as I understand L1 loss will introduce smoothness of predicted time series): Based on: https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder/blob/master/demo/demo.py As source this file used: https://google.github.io/tacotron/publications/tacotron2/demos/romance_gt.wav Results: reconstruct_smooth_features_example.zip Here I just use box filter:
|
Abouth the same test for griffin-lim algorithm, even at kernel_size 5 it already produce bad results. Results: griffin_lim_smooth_mel_reconstruction.zip
|
No description provided.
The text was updated successfully, but these errors were encountered: