New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try Monotonic Attention #24
Comments
Nice! I wonder if it will sound much better than the current 877k result. The results at 125k are already really nice. @lifeiteng do you plan to train it further? |
@lifeiteng I agree that the convergence time and attention alignment are really impressive.. |
@DarkDefender yes, continue training. |
|
@lifeiteng Do you have the slides of Tacotron on Interspeech? |
@candlewill It's poster, not oral. |
@lifeiteng Did you think that the paragraph version sounded more natural/better? If I took a wild guess, I would suspect that it would introduce a better flow between sentences in a paragraph. But not much else. |
@DarkDefender The audios' quality is great, but I did not make a careful comparison, even I think all examples are the newest version. |
@lifeiteng Did they post the updated samples anywhere public? I haven't been able to find any official updates. Besides that, have you managed to train the Monotonic Attention model any further? As we discussed previously it would be interesting to see if the quality improves dramatically over the current non-monotonic speech samples. |
But, I got pretty cool result (alignment is clear at 6k step, including Single and Multi-Speaker datasets) on my own implementation (different audio feature processing and other tunings) using mono-attention. Multi-Speaker version is same as Deep Voice2. |
@lifeiteng could the different feature process show the significant improvement? |
@pineking I test pcm value range [-1, 1] and [-32767, 32768],[-1, 1] fail to converge (this is strange, because FFT(x*32768) = 32768 * FFT(x)),the first author of Tacotron told me they use [-1, 1]. I don't speed time on this, just one pair experiments. |
@lifeiteng Hey, could you share results, if you trained it further, and if it converged? (After 125k steps, that is). |
results after 125k (This repo)multi-speaker results (my implementation)
|
close it. |
Hi~Have you ever used any other tricks in training? When I simply substitute the attention mechanism with the monotonic one, the model would fail to attend to the correct encoder step during inference (without ground truth given), but rapidly reach the last encoder step in a few steps, though the model would work when ground truth aligned. |
I tried Monotonic attention and got better result (alignment is clear at step 10k).
BahdanauAttention(256, encoder_outputs) -> BahdanauMonotonicAttention(256, encoder_outputs)
eval-125k.zip
The text was updated successfully, but these errors were encountered: