Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication #46

Closed
polvanrijn opened this issue Jul 14, 2020 · 15 comments
Closed

Replication #46

polvanrijn opened this issue Jul 14, 2020 · 15 comments

Comments

@polvanrijn
Copy link

polvanrijn commented Jul 14, 2020

Thanks @rafaelvalle for making the code and the pre-trained models available. Also thanks karkirowle for your code on style transfer in issue #9. I am preparing a presentation on Flowtron and wanted to replicate some of the audio files. Since the Sally model is not publicly available, I use pre-trained LJS model. This also means that the results cannot be replicated exactly. During the replication, I came across the following questions/issues:

  • In the paper it says "During inference we used sigma = 0.7" (p. 4), however in inference.py and in karkirowle's gist it is set to 0.8. Which value should I use for replication?
  • In total I performed three experiments (listen to them here and view the code here)
    • Experiment 1: the modification speech variation worked well with sigma.
    • Experiment 2: I noticed leaving out the final '.' (dot) at the end of the sentence drastically changes the prosody and sometimes even words. For example in the dogs example the word 'door' is pronounced differently. In the well_known example breathing sounds suddenly occur but more strikingly is that "latent space" changes to "latence". @rafaelvalle, do you know why this is the case?
    • Experiment 3: replicates transferring the 'surprised' style of a speaker (I used the same fragment as is presented on the demo page, btw. it is not speaker 03 as stated on the demo page. Speaker 03 is a male, the emotional prior on the demo site is a female speaker). Is it true you only used a single audio file as a prior? I tried to replicate it using both time-averaging and without time averaging. When leaving out time averaging, the produced speech is fuzzy (see surprised_humans_transfer_without_time_avg.wav). Applying time averaging makes the speech comprehensible, but does not really lead to a style transfer in my opinion. We can observe the transferred speech has a longer duration than the baseline, but the spectrogram does not resemble the reference at all (e.g. compare pitch excursions in reference signal and the pitch contour in both the baseline and the transferred fragment in the screenshots below). @rafaelvalle, do you know why these results are so different using the LJS instead of the Sally model?

Surprised prior
image

Baseline (random prior)
image

Transferred
image

Thanks in advance!

P.S.: For other users who are using the script, you always need to set the seed before performing model.infer() again otherwise it will not use that seed!

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Jul 14, 2020

  1. It is an issue with the attention mechanism. This will not happen with Flowtron Parallel, which we'll release soon.
    https://twitter.com/RafaelValleArt/status/1281268833504751616?s=20
  2. We used multiple files from the same speaker and style as evidence. Try it it this way and let us know

@polvanrijn
Copy link
Author

Cool, thanks for you fast reply! Regarding 3., could you tell me which utterances you exactly used?

@rafaelvalle
Copy link
Contributor

We used all utterances from speaker 24.

@DamienToomey
Copy link

DamienToomey commented Jul 29, 2020

@polvanrijn did you manage to transfer the 'surprised' style?

@polvanrijn
Copy link
Author

@rafaelvalle, thanks for you reply. @DamienToomey, yes, I just redid the analysis. See gist here and listen to the output here. To be honest, I see no resemblance to the fragment which is on the NVIDIA demo page. @rafaelvalle, do you know what causes such big deviations between the pre-trained models?

@rafaelvalle
Copy link
Contributor

Give us a few days to put a notebook up replicating some of our experiments.

@rafaelvalle
Copy link
Contributor

@polvanrijn
Copy link
Author

Thanks for the notebook! 👍 I also noticed you use a newer waveglow model and removed the fixed seed. As you don't use a fixed seed, I could not listen to the samples you exactly created in your notebook. What I noted if you rerun your script multiple times, you get very different samples. See gist here and listen to examples here. This is not only the case for the baseline, but also for the posterior sample. You can hear that the variation is huge. To me the posterior samples do not really resemble the audio clips the posterior was computed from, at least not as strongly as the sample on the NVIDIA demo page (but this is my perception and would need to be quantified). By observing this variation, I wonder, to what extend do the z-values capture meaningful properties of surprisal if the deviations are so big if you sample from them? If the variation in the sampling is so large, you will at some point, probably also generate a sample that sounds surprised, but this does not mean that you sampled from z values computed on surprised clips. This could also happen if you sample long enough from a normal distribution. So my question is: Why is there so much variation in the sampling?

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Aug 11, 2020

Thank you for these comments and questions, Pol.

You're right that if we sample long enough from a normal distribution we might end up getting sounds that sound surprising. This happens when the z value comes from a region in z-space that we associate with surprised samples. Now, imagine if instead of sampling the entire normal distribution, we could sample only from the region that produces surprised samples? This is exactly what we can achieve with the posterior sampling.

Take a look at the image below for an illustration. Imagine the blue and red points are z values obtained by running Flowtron forward using existing human data to obtain z values. Imagine that the red samples were labelled as surprised and that the blue samples have other labels. Now, consider that the blue circles represent the pdf for a standard Gaussian with zero mean and unit variance.

By randomly sampling the standard gaussian, with low probability we will get samples associated with surprise. Fortunately, we can sample from that region by sampling from a Gaussian with origin on that region. We can obtain a mean parameter for this Gaussian by computing the centroid over z values obtained from surprised samples. In addition to defining the mean, we need to define the variance of the Gaussian, which is the source of variation during sampling. As we increase the variance of the Gaussian, we end up sampling from region in z-space not associated with surprise. The red circles represents this Gaussian.

z-values

Take a look at the samples here and here, in which we perform an interpolation over time between the random gaussian and the posterior.

For more details, take a look at the Transflow paper

@DamienToomey
Copy link

After reading the Transflow paper, I was wondering why sigma in dist = Normal(mu_posterior.cpu(), sigma) (https://github.com/NVIDIA/flowtron/blob/master/inference_style_transfer.ipynb), is not divided by (ratio + 1) like in equation 8 (Transflow paper) whereas mu_posterior seems to be computed following equation 7 (Transflow paper)

@polvanrijn
Copy link
Author

Thank you for your detailed and clear explanation and the intuitive illustration, Rafael! Also thanks for the interesting paper. There are two things I still wonder about. First, Gambardella et al. (2020) propose to construct the posterior where the mean and the standard variation are computed analytically (p. 3). But aren't we solely estimating the mean and not estimating the standard deviation in the current implementation? In the notebook you sent we use sigma of 1 and in the Flowtron paper you propose sigma of 0.7. By the way the samples when using sigma of 0.7 sound way more like surprised than the ones sampled using sigma 1. It seems to me that the estimation of sigma is a necessary step to estimate realistic fragments. If it's too small you undershoot, if it's too big you overshoot and end up with fragments that do not sound like surprisal at al.

My second question is about the n_frames variable. I played around with it and found out that if you make it too small, e.g. 50, you end up with chopped sounds (listen to samples here). We also know the z-values are of a different 'length' (aka n_frames) for each of the stimuli and this is also why we extend or cut the z-values to be of the same dimensionality (namely 80 times n_frames). Now I wonder, how do you know the 'right' or even optimal value of n_frames? This question does not only needs to be addressed when you draw new samples from a distribution like I did, but also if you want to sample from a space that for example resembles surprisal. In order to do so, you need to force all the varied-length measured z-values into one fixed-sized matrix, which size is determined by n_frames.

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Aug 14, 2020

In the Transflow paper, lambda is a hyperparameter that controls both the mean and the variance as coupled parameters.
Relatively speaking, larger lambda values will produce a distribution with origin closer to the sample mean and smaller variance, while smaller lambda values will be closer to the zero mean and have larger variance.

This imposes a paradox: if you want more variance, you have to decrease lambda... but by decreasing lambda you move away from the posterior mean, which makes samples from the region in z-space associated with the style less probable than others. We circumvent this by treating the mean and variance as decoupled parameters. As Pol mentioned, the variance has to be adjusted accordingly.

Regarding choosing n_frames, it should be large enough such that the model has enough frames to produce the text input. Given that 1 second of audio has approximately 86 frames, trying to generate the sequence "Humans are walking on the street" with 40 frames is similar to expecting the model to fit the sentence in half a second, which is very unlikely.

@polvanrijn
Copy link
Author

Thank you for answering the questions. ☺️

You can also very nicely see that n_frames reaches a threshold at around 210 frames (see animation below, where z is simply a growing 80 X n_frames matrix filled with zeroes). In other words, the 210 are enough frames to produce the text "Humans are walking on the streets".
n_frames

If you now take a sliding vertical stride you can see that changes to the frames < 94 do not change the output Mel spectrogram. But changes to later frames (≥ 94) do change the spectrogram:
animation

The duration of the created file is 2390 ms, 2.39 X 86 is about 206 frames and confirms the minimal length of z for this sentence (300 - 94 = 206). What I now wonder how can you estimate a suitable number for n_frames without first generating an audio file and looking at its duration? Like you show in your paper (figure 3), different values in z lead to different durations (hence require more n_frames to be produced properly). n_frames so to say puts an upper border on the maximum fragment length. An extreme case is to fill the whole matrix with a high enough number, here I fill the whole 80 X 300 matrix with 2's. Here the duration becomes 3480 ms which is substantially longer than the baseline (2390 ms, 80 X 300 matrix filled with zeroes).
max_300_2

A question regarding the size of the z-space, is how to translate directly translate prosody across utterances. Emphasis so far has been to create audio that sounds similar to a bunch of other sounds but do not exactly translate prosody:

In the notebook you created, you compute a mean z-space over all utterances (in this case over 8 surprised sounds). As addressed in the notebook, the z-values are of a different dimensionality, ranging from 80 X 121 to 80 X 173. We now duplicate each matrix 2 or 3 times and cut it to exactly the same dimension 80 X 300. We compute a mean over all 8 'aligned' z-matrixes (and multiply and divide by the ratio) and finally draw from a normal distribution along with sigma which we need to set manually. If sigma is set properly, we do get values that sound like the input, e.g., we can observe an increased pitch range. What I still find quite puzzling, is that we do not (need to?) take care of the alignment of the different extracted z-matrixes, we just duplicate them and chop off the rest. This leaves me with two questions.

My first question is how do you deal with different lengths of text that need a differently-sized 80 X n_frames matrix? Say we have text A that only needs a length of 150 and text B needs a length of 200 and we set n_frames to 300. The same posterior can have a very different effect, right?

Related to the previous question: if different texts need a different number of n_frames, different parts of the same posterior might be used, which would lead to same prosodic effect at different parts of the sentence. Now I wonder the same prosodic effect at different parts of the sentence might have very different perceptual effects. For example, pitch fluctuation might be prototypical for surprisal, but possibly more saliently at specific parts of the sentence (e.g. the end of the sentence).

To avoid it, it would be interesting to directly translate the prosody from one sentence to another. I am currently working on an translation example where both texts produce a fragment with equal duration. However, I would not know how to apply such direct translation if the texts produce fragments with different durations.

@rafaelvalle
Copy link
Contributor

how to choose n_frames for different text lengths?
Flowtron has a gating mechanism that will remove extra frames. There are several approaches to dealing with not knowing n_frames in advance.
1) train a simple model that predicts n_frames given some text and a speaker, e.g. different speakers have different speech rates. The simples model is to take the average n_frames given text and speaker.
2) choose n_frames such that it maxes out the GPU memory and rely on the gate to remove the extra frames.

I still find quite puzzling, is that we do not (need to?) take care of the alignment of the different extracted z-matrixes
If we compute a flowtron forward pass to obtain Z on a single sample and do not average it over time, this Z is sentence dependent and each frame is highly associated with the sentence. Now, if we compute Zs on a large number of samples and average over batch, we're averaging out sentence dependent characteristics. With this, we're keeping only characteristics that are common to all sentences and frames.

It would be interesting to directly translate the prosody from one sentence to another.
Our first approach was to transfer rhythm and pitch contour from one sample to a speaker.
Take a look at Mellotron and samples on the website
Mellotron only takes into account non-textual characteristics that are easy to extract, like token duration and pitch.

We're currently working on a model, Flowtron Parallel, that is able to also convert non-textual characteristics that are hard to extract like breathiness, nasal voice, whispering. Take a look at this example in which we perform voice conversion, i.e. we replace the speaker in vc_source.wav with LJSpeech's speaker ftp_vc_zsourcespeaker_ljs.wav, while keeping the pitch contour, token durations, breathiness, somber voice from the source.
ftp_vc.zip

@polvanrijn
Copy link
Author

Thank you for your reply and for your approaches for dealing with not knowing n_frames in advance

If we compute a flowtron forward pass to obtain Z on a single sample and do not average it over time, this Z is sentence dependent and each frame is highly associated with the sentence. Now, if we compute Zs on a large number of samples and average over batch, we're averaging out sentence dependent characteristics. With this, we're keeping only characteristics that are common to all sentences and frames.

I understand what you are saying. I think I wasn't clear in my formulation. The averaging makes a lot of sense to me, but I wondered why we don't align the z-spaces (e.g. stretch them to be of the same size). Say for simplicity we compute an average on two Z matrixes extracted from two sound files, Z1 and Z2 respectively. In the current implementation, we repeat the Z-matrixes and cut of the part longer than n_frames. Now for each point in all matrixes (in this example only Z1 and Z2) we compute a mean. Since the extracted Z-matrixes are not of the same size, there will be cases where we compare the start of a sentence with the end of another sentence (see red area in figure below). I wonder if such a comparison is meaningful.

image

Regarding direct prosody transfer, I was not precise. I did not mean to take an extracted Z matrix of a fragment and directly synthesise a new sentence with it, but rather to draw Z from a normal distribution and apply it to different sentences. This is what I did in this gist. I selected 16 sentences from the Harvard sentences and generated 100 random Z matrixes and synthesised sounds from them. From those 16 X 100 sounds, I computed some simple acoustic measures (e.g. duration, mean pitch etc.). Then I computed a correlation matrix for each of those acoustic measures separately. Here are the average correlation coefficients (absolute correlations):

duration: 0.20
mean_pitch: 0.44
sd_pitch: 0.21
min_pitch: 0.21
max_pitch: 0.18
range_pitch: 0.17
slope_pitch: 0.20
mean_intensity: 0.24
e_0_500: 0.34
e_0_1000: 0.34
e_500_1000: 0.34
e_1000_2000: 0.35
e_0_2000: 0.35
e_2000_5000: 0.35
e_5000_8000: 0.24

I expected the same Z matrix would lead to similar changes across sentences, but the average correlations are rather low for some acoustic measures.

Thanks for mentioning Mellotron, it is on my todo list to look at next. :-) What you describe about Flowtron Parallel looks very promising. The example sounds sounds great. Can't wait until it's released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants