-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can we just reconstruct the wavefrom from fundamental frequency and loudness? #40
Comments
Another question I am really curious about is if we'd like to do human voice reconstruction from multiple sources(different people), should we consider timbre and include z in the model? |
Hi, glad it's working for you. I'd be happy to hear an example reconstruction if you want to share. My guess is that the model is probably overfitting quite a lot to a small dataset. In that case, segment of loudness and f0 corresponds to a specific phoneme because the dataset doesn't have enough variation. For a large dataset, there will be one to many mappings that the model can't handle without more conditioning (latent or labels). We don't use the latent "z" variables in the models in the timbre_transfer and train_autoencoder colabs, but the encoders and decoders are in the code base and used in models/nsynth_ae.gin as an example. My intuition is that the model should work well for TTS (the sinusoidal model it's based off is used in audio codecs, so we know it should be able to fit it), but you just need to add grapheme or phoneme conditioning. |
Thanks a lot for your reply!
Do you mean we can add conditionings besides z, f0 and loudness? You also mentioned I could add grapheme or phoneme conditioning for TTS task, do you mean using an encoder to extract phoneme, grapheme or other conditioning and concat with z, f0 and loudness (do we have f0 and loudness in TTS task?) and then feed them to decoder?
|
There are a lot of options to try, we only have results based on our published work. If you want control over the output, you need to condition with variables that you know how to control. For instance, most TTS systems only use phonemes or text as conditioning, and then let the network figure out what to do with them. You can try to figure out how to interpret Z, but it is not trained to be interpretable as is. |
Thanks for your reply! For conditioning, do you mean the features after encoder part? If we want more conditioning, do you mean we could try to use some network to encode phoneme or graphene as conditioning? Should I try to make the conditioning similar to similar words? Is it a rule to follow(to find proper conditioning)? |
The tacotron papers (https://google.github.io/tacotron/) have extensively investigated different types of TTS conditioning. I suggest you check out some of their work. |
Hey, I just got a really good reconstruction result which is too good to be true. I have a sense that the idea behind the model is really good but it is still so amazing to me. I just use your demo autoencoder to reconstruct audios from the human voice and the result is really good. But I could not understand how it can be achieved by only using f0 and loudness information? For example, the vowel 'a' and 'e' is definitely different, how does this be reflected through f0 and loudness? I thought there might be some difference between musical instruments and human voice. I just couldn't understand that these features are enough.
By the way, if I want to add z as latent space besides f0 and loudness, how can I tell the model to use it? I thought you mentioned in the paper that z may correspond to timbre information but I couldn't find it in
timbre_transfer.ipynb
, can you achieve timbre transfer without z?The text was updated successfully, but these errors were encountered: