Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved acoustic model support: introducing autoregressive structure #15

Closed
3 of 4 tasks
r9y9 opened this issue Sep 7, 2020 · 4 comments · Fixed by #31 or #129
Closed
3 of 4 tasks

Improved acoustic model support: introducing autoregressive structure #15

r9y9 opened this issue Sep 7, 2020 · 4 comments · Fixed by #31 or #129
Assignees
Labels
discussion enhancement New feature or request

Comments

@r9y9
Copy link
Collaborator

r9y9 commented Sep 7, 2020

As in the shallow AR model proposed by Xin Wang.

The issue was part of #1 but I raised a new issue since this is one of the very important action items to improve the singing voice synthesis quality. Specific discussion and progress can be done in this thread. Welcome any comments and suggestions.

@r9y9 r9y9 added enhancement New feature or request discussion labels Sep 7, 2020
@r9y9 r9y9 changed the title Ehance acoustic model: introducing autoregressive structure Improved acoustic model support: introducing autoregressive structure Sep 7, 2020
@taroushirani
Copy link
Contributor

taroushirani commented Sep 26, 2020

Hello, I try to change the acoustic model from Conv1dResnet to RMDN as a preliminary step towards Shallow AR and I'm suffered from a shortage of GPU memory.

MDN returns pi, sigma, mu and total number of returned parameters is BxGx(2xD_out + 1) (B is batch size, G is number of Gaussian components, D_out is the number of dimension of target variable). In my experience, the upper limit of this value seems to be about 4000 when using GPU with 8GB memory. Because the out_dim of the default acoustic model is 199, we can set the number of Gaussian components to at most 10 or so, but this may be not enough to get good results.

Is there any need to train f0, mgc, bap separately, or is there good method to reduce memory consumption?

@r9y9
Copy link
Collaborator Author

r9y9 commented Sep 27, 2020

As I mentioned in #20, the number of Gaussians wouldn't be a large value (>16) in general, so I suppose using a mixture of density networks doesn't increase GPU memory usage dramatically. If we want to save GPU memory usage, we can use a smaller batch size, and it should be okay in my experience. If the batch size matters, we can implement a gradient accumulation trick.

Is there any need to train f0, mgc, bap separately

If we want to try the shallow-AR model, we would need to model F0 separately, at least. I didn't do that for simplicity, and my time was limited. Anyway, It's worth trying. If we implement the separate stream-wise training strategy, we can reduce GPU memory usage accordingly.

@taroushirani
Copy link
Contributor

Thank you for your comment. I'll try the smaller number of Gaussians and investigate gradient accumulation trick.

@r9y9
Copy link
Collaborator Author

r9y9 commented Jul 3, 2022

I'm revisiting the autoregressive model. I managed to make my implementation work reasonably well, but not significantly better than the current models. #129

Pros

  • (Slightly) better temporal dynamics
  • (Slightly) less-smoothed outputs

Cons

  • Extremely slow training. It takes a week for 100 epochs with Ritsu's database. It depends on settings though.

@r9y9 r9y9 self-assigned this Jul 3, 2022
@r9y9 r9y9 closed this as completed in #129 Jul 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
2 participants