Try Monotonic Attention #24

lifeiteng · 2017-08-18T10:24:07Z

I tried Monotonic attention and got better result (alignment is clear at step 10k).

update to TensorFlow 1.3
modify models/tacotron.py BahdanauAttention(256, encoder_outputs) -> BahdanauMonotonicAttention(256, encoder_outputs)

The text was updated successfully, but these errors were encountered:

DarkDefender · 2017-08-20T01:22:14Z

Nice! I wonder if it will sound much better than the current 877k result.
IE will the end quality converge with the previous method or will it be able to surpass it.

The results at 125k are already really nice. @lifeiteng do you plan to train it further?

ElevenGameStudios · 2017-08-20T19:50:52Z

@lifeiteng I agree that the convergence time and attention alignment are really impressive..
But I am not quite sure about the voice, it sounds pretty clear, but also weirdly stretched out at this point. I wonder whether it will converge to a more natural sounding rhythm over training time? Or is this kind of rhythmic "normalization" part of the monotonic attention?
Well, yeah, it would be really interesting to see how this converges after more training!

lifeiteng · 2017-08-21T16:34:48Z

@DarkDefender yes, continue training.
I'm on a business trip (interspeech 2017), then a short vacation. I will give more analysis and results after September 4th.

lifeiteng · 2017-08-24T15:19:34Z

Yuxuan the first author of Tacotron said that they also use Monotonic Attention in their newest version. They showed paragraph synthesis(more than 400 chars).

candlewill · 2017-08-24T15:27:20Z

@lifeiteng Do you have the slides of Tacotron on Interspeech?

lifeiteng · 2017-08-24T16:55:14Z

@candlewill It's poster, not oral.

DarkDefender · 2017-08-24T17:03:33Z

@lifeiteng Did you think that the paragraph version sounded more natural/better? If I took a wild guess, I would suspect that it would introduce a better flow between sentences in a paragraph. But not much else.

lifeiteng · 2017-08-25T19:47:17Z

@DarkDefender The audios' quality is great, but I did not make a careful comparison, even I think all examples are the newest version.

DarkDefender · 2017-09-13T18:07:28Z

@lifeiteng Did they post the updated samples anywhere public? I haven't been able to find any official updates.

Besides that, have you managed to train the Monotonic Attention model any further? As we discussed previously it would be interesting to see if the quality improves dramatically over the current non-monotonic speech samples.

lifeiteng · 2017-09-14T08:17:24Z

It seems not published.
Actually I failed to reproduce the mono-attention result on the newest version of this repo.

But, I got pretty cool result (alignment is clear at 6k step, including Single and Multi-Speaker datasets) on my own implementation (different audio feature processing and other tunings) using mono-attention. Multi-Speaker version is same as Deep Voice2.

pineking · 2017-09-14T11:25:12Z

@lifeiteng could the different feature process show the significant improvement?

lifeiteng · 2017-09-14T12:41:42Z

@pineking I test pcm value range [-1, 1] and [-32767， 32768]，[-1, 1] fail to converge (this is strange, because FFT(x*32768) = 32768 * FFT(x))，the first author of Tacotron told me they use [-1, 1]. I don't speed time on this, just one pair experiments.

saxenauts · 2017-09-14T13:13:48Z

@lifeiteng Hey, could you share results, if you trained it further, and if it converged? (After 125k steps, that is).
Also, what dataset are you using?
And, as @ElevenGameStudios pointed out, the syllables seems stretched out. Is it because of monotonic attention?
Thanks.

lifeiteng · 2017-09-15T01:45:01Z

results after 125k (This repo)

step-550k.zip
step-883k.zip

multi-speaker results (my implementation)

read.zip

He has read the whole thing.
He reads books.

lifeiteng · 2017-09-15T01:45:40Z

close it.

mutiann · 2018-12-11T03:35:15Z

Hi~Have you ever used any other tricks in training? When I simply substitute the attention mechanism with the monotonic one, the model would fail to attend to the correct encoder step during inference (without ground truth given), but rapidly reach the last encoder step in a few steps, though the model would work when ground truth aligned.

lifeiteng closed this as completed Sep 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try Monotonic Attention #24

Try Monotonic Attention #24

lifeiteng commented Aug 18, 2017

DarkDefender commented Aug 20, 2017 •

edited

ElevenGameStudios commented Aug 20, 2017

lifeiteng commented Aug 21, 2017 •

edited

lifeiteng commented Aug 24, 2017

candlewill commented Aug 24, 2017

lifeiteng commented Aug 24, 2017

DarkDefender commented Aug 24, 2017

lifeiteng commented Aug 25, 2017

DarkDefender commented Sep 13, 2017 •

edited

lifeiteng commented Sep 14, 2017

pineking commented Sep 14, 2017

lifeiteng commented Sep 14, 2017

saxenauts commented Sep 14, 2017

lifeiteng commented Sep 15, 2017

lifeiteng commented Sep 15, 2017

mutiann commented Dec 11, 2018

Try Monotonic Attention #24

Try Monotonic Attention #24

Comments

lifeiteng commented Aug 18, 2017

DarkDefender commented Aug 20, 2017 • edited

ElevenGameStudios commented Aug 20, 2017

lifeiteng commented Aug 21, 2017 • edited

lifeiteng commented Aug 24, 2017

candlewill commented Aug 24, 2017

lifeiteng commented Aug 24, 2017

DarkDefender commented Aug 24, 2017

lifeiteng commented Aug 25, 2017

DarkDefender commented Sep 13, 2017 • edited

lifeiteng commented Sep 14, 2017

pineking commented Sep 14, 2017

lifeiteng commented Sep 14, 2017

saxenauts commented Sep 14, 2017

lifeiteng commented Sep 15, 2017

results after 125k (This repo)

multi-speaker results (my implementation)

lifeiteng commented Sep 15, 2017

mutiann commented Dec 11, 2018

DarkDefender commented Aug 20, 2017 •

edited

lifeiteng commented Aug 21, 2017 •

edited

DarkDefender commented Sep 13, 2017 •

edited