-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved acoustic model support: introducing autoregressive structure #15
Comments
Hello, I try to change the acoustic model from Conv1dResnet to RMDN as a preliminary step towards Shallow AR and I'm suffered from a shortage of GPU memory. MDN returns pi, sigma, mu and total number of returned parameters is BxGx(2xD_out + 1) (B is batch size, G is number of Gaussian components, D_out is the number of dimension of target variable). In my experience, the upper limit of this value seems to be about 4000 when using GPU with 8GB memory. Because the out_dim of the default acoustic model is 199, we can set the number of Gaussian components to at most 10 or so, but this may be not enough to get good results. Is there any need to train f0, mgc, bap separately, or is there good method to reduce memory consumption? |
As I mentioned in #20, the number of Gaussians wouldn't be a large value (>16) in general, so I suppose using a mixture of density networks doesn't increase GPU memory usage dramatically. If we want to save GPU memory usage, we can use a smaller batch size, and it should be okay in my experience. If the batch size matters, we can implement a gradient accumulation trick.
If we want to try the shallow-AR model, we would need to model F0 separately, at least. I didn't do that for simplicity, and my time was limited. Anyway, It's worth trying. If we implement the separate stream-wise training strategy, we can reduce GPU memory usage accordingly. |
Thank you for your comment. I'll try the smaller number of Gaussians and investigate gradient accumulation trick. |
I'm revisiting the autoregressive model. I managed to make my implementation work reasonably well, but not significantly better than the current models. #129 Pros
Cons
|
As in the shallow AR model proposed by Xin Wang.
The issue was part of #1 but I raised a new issue since this is one of the very important action items to improve the singing voice synthesis quality. Specific discussion and progress can be done in this thread. Welcome any comments and suggestions.
MDN + ARThe text was updated successfully, but these errors were encountered: