Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attention weights with partial flat line (non-english) #137

Closed
SornrasakC opened this issue Oct 13, 2021 · 6 comments
Closed

Attention weights with partial flat line (non-english) #137

SornrasakC opened this issue Oct 13, 2021 · 6 comments

Comments

@SornrasakC
Copy link

SornrasakC commented Oct 13, 2021

Hi, I have been trying to train this model with Thai-dataset (1 speaker, ~5 hour).

After ~80k Steps (batch size = 1, ~31 epoch), the attention weights turns out like this

image

Is it normal to see partial flat lines like this? all the issues I looked through only sees entire flat line or just straight diagonal...
Or am I being too impatient? it's just 80k steps after all.

Here's some additional info
image

(Is this even correct?)
image

The above result comes from me warm starting the model from flowtron_ljs.pt with the flow=1 config file (speaker_embedding.weight ignored)

Things I have done

  • Trim the start and end of the sound using Librosa, also filtered out any data with a duration longer than 10 secs.
  • Set the p_arpabet=0 and instead of using Thai symbols, I convert all my filelists into IPAs first, and then add those IPAs symbols to the symbols.py
  • Change the cleaner so that they don't transliterate.
  • Add this line to ignore embedding.weight since they have different shape during warmstart.

Additional Questions

  • Does the above gate output make sense? does this means the model think all sounds end at 350-ich frame?
  • What happens if I train the model with only one flow? can I still do the inference and/or style transfer with lower score? or would it just goes un-usable due to absence of reverse mapping(?)
  • To my understanding, the process I should be going through is first, train the model flow=1 until attentions aligned, seconds, same but flow=2, and then third, turns the attn_prior off to attends the speaker. What's the sign to look for during third step? how do I know if the model has attended?
  • In default config, the ctc_loss starts at 10k iters, do I need to change this? does starting this earlier or later affects anything?
  • Does warm starting from flowtron_ljs really helps in learning different language? I'm wondering which parts did it helps with, the decoder?
  • Regarding emotion transfer, If I happened to get this Thai-dataset working, can I use an English dataset to transfer the emotion into this? Or do I also need Thai-dataset with emotion labeled as well?

Thank you for reading and would really appreciate any answers or suggestions.

@Bahm9919
Copy link

Bahm9919 commented Oct 15, 2021

I think The problem is representation of your symbols with text. You must get good alignment in 10k steps. If don't its mean something get wrong.

Yes. Warmstart helping.

I don't know whats happening, but yes you can do inference with only 1 flow.

@SornrasakC
Copy link
Author

I think The problem is representation of your symbols with text. You must get good alignment in 10k steps. If don't its mean something get wrong.

Yes. Warmstart helping.

I don't know whats happening, but yes you can do inference with only 1 flow.

I guess.. I will try changing my symbols to ASCII, will come update soon.

@SornrasakC
Copy link
Author

Sorry for a very late reply.

I tried changing my symbols to characters such as @A1 @A2 along with a new filelists that is already converted to the added symbols.

Here's the results after 20k iters, which I honestly think doesn't have any differences... I wonder where did it goes wrong.

image

image

@SornrasakC
Copy link
Author

Turn out, 20 out of ~2k audio files was just pure noise, removing them solves the problem.

image

@SornrasakC
Copy link
Author

@Bahm9919
Sorry for the @ and for commenting on the closed issue.
I saw that it seems like you just recently able to do the inference, would it be possible for you to kindly share how you did it?

Especially

  • Torch version
  • Any changes in inference.py
  • Which WaveGlow weight
  • Does your submodule has this commit? Submodule path 'tacotron2': checked out '6f435f7f29c3e1553cf2dd7ca2daf56903b20c39'

Or anything else, surprisingly?
Very much appreciated.

@Bahm9919
Copy link

@Bahm9919 Sorry for the @ and for commenting on the closed issue. I saw that it seems like you just recently able to do the inference, would it be possible for you to kindly share how you did it?

Especially

  • Torch version
  • Any changes in inference.py
  • Which WaveGlow weight
  • Does your submodule has this commit? Submodule path 'tacotron2': checked out '6f435f7f29c3e1553cf2dd7ca2daf56903b20c39'

Or anything else, surprisingly? Very much appreciated.

I see your issue, will answer there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants