Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try Monotonic Attention #24

Closed
lifeiteng opened this issue Aug 18, 2017 · 16 comments
Closed

Try Monotonic Attention #24

lifeiteng opened this issue Aug 18, 2017 · 16 comments

Comments

@lifeiteng
Copy link

I tried Monotonic attention and got better result (alignment is clear at step 10k).
eval-125000-0

  1. update to TensorFlow 1.3
  2. modify models/tacotron.py BahdanauAttention(256, encoder_outputs) -> BahdanauMonotonicAttention(256, encoder_outputs)

eval-125k.zip

@DarkDefender
Copy link

DarkDefender commented Aug 20, 2017

Nice! I wonder if it will sound much better than the current 877k result.
IE will the end quality converge with the previous method or will it be able to surpass it.

The results at 125k are already really nice. @lifeiteng do you plan to train it further?

@ElevenGameStudios
Copy link

@lifeiteng I agree that the convergence time and attention alignment are really impressive..
But I am not quite sure about the voice, it sounds pretty clear, but also weirdly stretched out at this point. I wonder whether it will converge to a more natural sounding rhythm over training time? Or is this kind of rhythmic "normalization" part of the monotonic attention?
Well, yeah, it would be really interesting to see how this converges after more training!

@lifeiteng
Copy link
Author

lifeiteng commented Aug 21, 2017

@DarkDefender yes, continue training.
I'm on a business trip (interspeech 2017), then a short vacation. I will give more analysis and results after September 4th.

@lifeiteng
Copy link
Author

Yuxuan the first author of Tacotron said that they also use Monotonic Attention in their newest version. They showed paragraph synthesis(more than 400 chars).

@candlewill
Copy link
Contributor

@lifeiteng Do you have the slides of Tacotron on Interspeech?

@lifeiteng
Copy link
Author

@candlewill It's poster, not oral.

@DarkDefender
Copy link

@lifeiteng Did you think that the paragraph version sounded more natural/better? If I took a wild guess, I would suspect that it would introduce a better flow between sentences in a paragraph. But not much else.

@lifeiteng
Copy link
Author

@DarkDefender The audios' quality is great, but I did not make a careful comparison, even I think all examples are the newest version.
wechatimg7

@DarkDefender
Copy link

DarkDefender commented Sep 13, 2017

@lifeiteng Did they post the updated samples anywhere public? I haven't been able to find any official updates.

Besides that, have you managed to train the Monotonic Attention model any further? As we discussed previously it would be interesting to see if the quality improves dramatically over the current non-monotonic speech samples.

@lifeiteng
Copy link
Author

  1. It seems not published.
  2. Actually I failed to reproduce the mono-attention result on the newest version of this repo.

But, I got pretty cool result (alignment is clear at 6k step, including Single and Multi-Speaker datasets) on my own implementation (different audio feature processing and other tunings) using mono-attention. Multi-Speaker version is same as Deep Voice2.

@pineking
Copy link

@lifeiteng could the different feature process show the significant improvement?

@lifeiteng
Copy link
Author

@pineking I test pcm value range [-1, 1] and [-32767, 32768],[-1, 1] fail to converge (this is strange, because FFT(x*32768) = 32768 * FFT(x)),the first author of Tacotron told me they use [-1, 1]. I don't speed time on this, just one pair experiments.

@saxenauts
Copy link

@lifeiteng Hey, could you share results, if you trained it further, and if it converged? (After 125k steps, that is).
Also, what dataset are you using?
And, as @ElevenGameStudios pointed out, the syllables seems stretched out. Is it because of monotonic attention?
Thanks.

@lifeiteng
Copy link
Author

results after 125k (This repo)

step-550k.zip
step-883k.zip

multi-speaker results (my implementation)

read.zip

He has read the whole thing.
He reads books.

@lifeiteng
Copy link
Author

close it.

@mutiann
Copy link

mutiann commented Dec 11, 2018

Hi~Have you ever used any other tricks in training? When I simply substitute the attention mechanism with the monotonic one, the model would fail to attend to the correct encoder step during inference (without ground truth given), but rapidly reach the last encoder step in a few steps, though the model would work when ground truth aligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants