-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication #46
Comments
|
Cool, thanks for you fast reply! Regarding 3., could you tell me which utterances you exactly used? |
We used all utterances from speaker 24. |
@polvanrijn did you manage to transfer the 'surprised' style? |
@rafaelvalle, thanks for you reply. @DamienToomey, yes, I just redid the analysis. See gist here and listen to the output here. To be honest, I see no resemblance to the fragment which is on the NVIDIA demo page. @rafaelvalle, do you know what causes such big deviations between the pre-trained models? |
Give us a few days to put a notebook up replicating some of our experiments. |
Please take a look at https://github.com/NVIDIA/flowtron/blob/master/inference_style_transfer.ipynb |
Thanks for the notebook! 👍 I also noticed you use a newer waveglow model and removed the fixed seed. As you don't use a fixed seed, I could not listen to the samples you exactly created in your notebook. What I noted if you rerun your script multiple times, you get very different samples. See gist here and listen to examples here. This is not only the case for the baseline, but also for the posterior sample. You can hear that the variation is huge. To me the posterior samples do not really resemble the audio clips the posterior was computed from, at least not as strongly as the sample on the NVIDIA demo page (but this is my perception and would need to be quantified). By observing this variation, I wonder, to what extend do the z-values capture meaningful properties of surprisal if the deviations are so big if you sample from them? If the variation in the sampling is so large, you will at some point, probably also generate a sample that sounds surprised, but this does not mean that you sampled from z values computed on surprised clips. This could also happen if you sample long enough from a normal distribution. So my question is: Why is there so much variation in the sampling? |
Thank you for these comments and questions, Pol. You're right that if we sample long enough from a normal distribution we might end up getting sounds that sound surprising. This happens when the z value comes from a region in z-space that we associate with surprised samples. Now, imagine if instead of sampling the entire normal distribution, we could sample only from the region that produces surprised samples? This is exactly what we can achieve with the posterior sampling. Take a look at the image below for an illustration. Imagine the blue and red points are z values obtained by running Flowtron forward using existing human data to obtain z values. Imagine that the red samples were labelled as surprised and that the blue samples have other labels. Now, consider that the blue circles represent the pdf for a standard Gaussian with zero mean and unit variance. By randomly sampling the standard gaussian, with low probability we will get samples associated with surprise. Fortunately, we can sample from that region by sampling from a Gaussian with origin on that region. We can obtain a mean parameter for this Gaussian by computing the centroid over z values obtained from surprised samples. In addition to defining the mean, we need to define the variance of the Gaussian, which is the source of variation during sampling. As we increase the variance of the Gaussian, we end up sampling from region in z-space not associated with surprise. The red circles represents this Gaussian. Take a look at the samples here and here, in which we perform an interpolation over time between the random gaussian and the posterior. For more details, take a look at the Transflow paper |
After reading the Transflow paper, I was wondering why |
Thank you for your detailed and clear explanation and the intuitive illustration, Rafael! Also thanks for the interesting paper. There are two things I still wonder about. First, Gambardella et al. (2020) propose to construct the posterior where the mean and the standard variation are computed analytically (p. 3). But aren't we solely estimating the mean and not estimating the standard deviation in the current implementation? In the notebook you sent we use My second question is about the |
In the Transflow paper, lambda is a hyperparameter that controls both the mean and the variance as coupled parameters. This imposes a paradox: if you want more variance, you have to decrease lambda... but by decreasing lambda you move away from the posterior mean, which makes samples from the region in z-space associated with the style less probable than others. We circumvent this by treating the mean and variance as decoupled parameters. As Pol mentioned, the variance has to be adjusted accordingly. Regarding choosing |
We're currently working on a model, Flowtron Parallel, that is able to also convert non-textual characteristics that are hard to extract like breathiness, nasal voice, whispering. Take a look at this example in which we perform voice conversion, i.e. we replace the speaker in vc_source.wav with LJSpeech's speaker |
Thank you for your reply and for your approaches for dealing with not knowing
I understand what you are saying. I think I wasn't clear in my formulation. The averaging makes a lot of sense to me, but I wondered why we don't align the z-spaces (e.g. stretch them to be of the same size). Say for simplicity we compute an average on two Z matrixes extracted from two sound files, Z1 and Z2 respectively. In the current implementation, we repeat the Z-matrixes and cut of the part longer than Regarding direct prosody transfer, I was not precise. I did not mean to take an extracted Z matrix of a fragment and directly synthesise a new sentence with it, but rather to draw Z from a normal distribution and apply it to different sentences. This is what I did in this gist. I selected 16 sentences from the Harvard sentences and generated 100 random Z matrixes and synthesised sounds from them. From those 16 X 100 sounds, I computed some simple acoustic measures (e.g. duration, mean pitch etc.). Then I computed a correlation matrix for each of those acoustic measures separately. Here are the average correlation coefficients (absolute correlations):
I expected the same Z matrix would lead to similar changes across sentences, but the average correlations are rather low for some acoustic measures. Thanks for mentioning Mellotron, it is on my todo list to look at next. :-) What you describe about Flowtron Parallel looks very promising. The example sounds sounds great. Can't wait until it's released. |
Thanks @rafaelvalle for making the code and the pre-trained models available. Also thanks karkirowle for your code on style transfer in issue #9. I am preparing a presentation on Flowtron and wanted to replicate some of the audio files. Since the Sally model is not publicly available, I use pre-trained LJS model. This also means that the results cannot be replicated exactly. During the replication, I came across the following questions/issues:
inference.py
and in karkirowle's gist it is set to 0.8. Which value should I use for replication?dogs
example the word 'door' is pronounced differently. In thewell_known
example breathing sounds suddenly occur but more strikingly is that "latent space" changes to "latence". @rafaelvalle, do you know why this is the case?surprised_humans_transfer_without_time_avg.wav
). Applying time averaging makes the speech comprehensible, but does not really lead to a style transfer in my opinion. We can observe the transferred speech has a longer duration than the baseline, but the spectrogram does not resemble the reference at all (e.g. compare pitch excursions in reference signal and the pitch contour in both the baseline and the transferred fragment in the screenshots below). @rafaelvalle, do you know why these results are so different using the LJS instead of the Sally model?Surprised prior
Baseline (random prior)
Transferred
Thanks in advance!
P.S.: For other users who are using the script, you always need to set the seed before performing
model.infer()
again otherwise it will not use that seed!The text was updated successfully, but these errors were encountered: