Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train a better Speaker Encoder #512

Closed
6 tasks done
erogol opened this issue Sep 11, 2020 · 79 comments
Closed
6 tasks done

Train a better Speaker Encoder #512

erogol opened this issue Sep 11, 2020 · 79 comments
Labels
discussion help wanted Extra attention is needed improvement a new feature

Comments

@erogol
Copy link
Contributor

erogol commented Sep 11, 2020

Our current speaker encoder is trained with only LibriTTS (100, 360) datasets. However, we can improve its performance using other available datasets (VoxCeleb, LibriTTS-500, Common Voice etc.). It will also increase the performance of our multi-speaker model and makes it easier to adapt to new voices.

I can't really work on this alone due to the recent changes and the amount of work needed therefore I need some hand here to work together.

So I can list the TODO as follows and feel free to contribute to any part of it or suggest changes;

  • decide target datasets
  • download and preprocess the datasets
  • write preprocessors for new datasets
  • increase the efficiency of the speaker encoder data-loader.
  • training a model only using Eng datasets.
  • training a model with all the available datasets.
@erogol erogol added improvement a new feature help wanted Extra attention is needed discussion labels Sep 11, 2020
@erogol erogol pinned this issue Sep 11, 2020
@mueller91
Copy link
Contributor

Hi Erogol,
i'm up for it.

  • I've already downloaded VoxCeleb1+2, Mozilla Common Voice EN (latest version), LibriTTS, LibriSpeech Train-Other-500 and VCTK and written corresponding preprocessors. When selecting only speakers with >= 10 utterances, this yields >15k individual speakers.
  • I've included caching for the dataset's MFCC computation (so it is not redone every time)
  • I've also started training, but due to hardware constraints (very slow SSD, whose seek-time incurse a 400% overhead, see here) i cannot train efficiently. Can you provide access to better hardware?

My code is not based on the TTS repo, but i'll try to integrate it and submit a PR in the upcoming days.

@erogol
Copy link
Contributor Author

erogol commented Sep 16, 2020

@mueller91 that is great. I can train the model if you can send me a script which does all the processing. I am not use if there is enough space to allocate all the datasets but I can try. BTW I also don't have SSD to fit all the dataset.

I think the latency is normal since for each batch it loads a lot of data and I don't think computing specs on the fly is not the cause of the problem.

One option is to keep a number of batches in memory and sample from it to fill the half of the next batch and load the rest from disk. I think that would reduce the requirements quite a lot. Does it make sense?

@mueller91
Copy link
Contributor

Hi @erogol ,
okay, I'll integrate my solution in mozilla_TTS and report back.

To minimize the data loaded from disk, your suggestion makes sense; but for all utterances we'd reuse, we'd have identical pairs in the GE2E loss matrix as in the batch before. Not sure if that's desirable ...

I was thinking to re-use a given batch two or three times, and just select a new random 1.6s section of the MFCC for each utterance. What do you think?

@erogol
Copy link
Contributor Author

erogol commented Sep 16, 2020

To minimize the data loaded from disk, your suggestion makes sense; but for all utterances we'd reuse, we'd have identical pairs in the GE2E loss matrix as in the batch before. Not sure if that's desirable ...

Let's assume our batch size is B. For instance if we keep N batches in memory and replace the Nth batch with the next batch and sample B/2 instances from in memory samples and sample the rest from the HDD instances, it is very likely that every batch is different from one another. That is more important than having same couples in the batch couple of times since the average gradient would be different. What do you think?

I was thinking to re-use a given batch two or three times, and just select a new random 1.6s section of the MFCC for each utterance. What do you think?

This also sounds like a good idea. Maybe we can combine these two ideas.

we can also induce identical random noise to each speaker in batch so even if we use the same speaker from the cache, model sees slightly different version of the speaker's voice.

@mueller91
Copy link
Contributor

mueller91 commented Sep 16, 2020

Sound good, I'll implement those three ideas.

Also, i've just added the LibriTTS, VoxCeleb1+2 and CommonVoice datasets:
This yields 25k speakers (without skipping those with <10 utters).

 > DataLoader initialization
 | > Number of instances : 2072676
 | > Sequence length: 25600
 | > Num speakers: 25514

Finally: Is there a reason do_trim_silence is set to false per default? My intuition is that removing silence gives the SE more 'information' to work with.

@erogol
Copy link
Contributor Author

erogol commented Sep 16, 2020

I just assumed that the datasets are preprocessed and like to keep a bit of silence to be robust against it. But might be set differently for different cases.

@mueller91
Copy link
Contributor

I implemented the three improvements we discussed above.
The I/O overhead is reduced from about 3-5 seconds per batch to about 0-1 seconds, i can now train about 2000 steps in about 80 minutes (this is with 768 hidden neurons and a 64 speakers per batch, as in the original paper).

Attached are first tensorflow plots. You can find the source in my fork. I'll keep training this for a bit and then publish the model + config. Let me know if you are interested in a different set of hyper parameters
1
0

@erogol
Copy link
Contributor Author

erogol commented Sep 17, 2020

it is great!! Looks like the loss id smoothly going down.

How many samples do you have in total for this model? Have you done any particular changes to the model?

I was planning to remove the last ReLU layer that, in my opinion, skews the output distribution. Also with all these datasets, we could train a larger model.

You could also use AngleProtoLoss which @Edresson reported better results.

Are you planning to share the model and the code at the end? If you are, then I can work on the universal vocoder more and @Edresson is working on the multi-speaker TTS model. After we merge all these we would have the best model possible.

@Edresson
Copy link
Contributor

@mueller91 this is very good congratulations :).

As @erogol commented, I got better results with Angular Prototypical (AngleProtoLoss) in my training. I recommend you try to use it :). The paper In defense of metric learning for speaker recognition also shows a superiority of Angular Prototypical.

@mueller91
Copy link
Contributor

mueller91 commented Sep 17, 2020

With the datasets mentioned above, i have 25.5k speakers and 2M utterances.
I'm familiar with Angular Prototyping and have enabled it in the current model. Also, i enabled trimming silences (since a lot of the datasets are not preprocessed) and am using LSTMWithProjection, with has linear output, not Relu - i agree that Relu skews the output. Maybe sigmoid would also be appropriate...

You can see my config here.

I've submitted a PR, and will be happy to share the model once trained.

@Edresson
Copy link
Contributor

@mueller91 Could you train the model with the audio config of this config here?

This would allow us to use this model to calculate the loss in the TTS model and generate speakers with the voice closest to the originals :)

@mueller91
Copy link
Contributor

mueller91 commented Sep 18, 2020

@Edresson Are you planning on using the speaker encoder to create an additional similarity loss term for the multi-speaker tacotron? I tried that for a while with my own implementation, didn't improve anything, but also my speaker encoder was bad back then.
Google, in their original multi TTS paper, say they don't do that either, but there is another paper where the authors say it helped, so who knows. I'll give it a try with your parameters.

Most of the datasets are 16khz, so upsampling to 22050hz may slow the data loader down, i'll have to see how it turns out. Upsampling should not affect the MFCCs in a negative way, right?

@erogol
Copy link
Contributor Author

erogol commented Sep 18, 2020

I am not sure but the sampling rate in speaker encoder would not make an important difference. In the end, TTS model would learn what it needs to learn from the embedding regardless of the encoder rate. But maybe I am wrong.

@Edresson
Copy link
Contributor

@Edresson Are you planning on using the speaker encoder to create an additional similarity loss term for the multi-speaker tacotron? I tried that for a while with my own implementation, didn't improve anything, but also my speaker encoder was bad back then.
Google, in their original multi TTS paper, say they don't do that either, but there is another paper where the authors say it helped, so who knows. I'll give it a try with your parameters.

Most of the datasets are 16khz, so upsampling to 22050hz may slow the data loader down, i'll have to see how it turns out. Upsampling should not affect the MFCCs in a negative way, right?

Yes, exactly that I've tried this and the results improve even using a bad speaker encoder. Training with a better speaker encoder should improve even more, especially for speakers not seen during training.

Resample is really slow.

@erogol In some tests that I did when I test a 16khz audio on a 22khz trained encoder speaker upsample to 22khz, the performance drops a lot. However I didn't try without the upsampling.

@mueller91 @erogol Do you think it is feasible and make sense to train with audios at 22 kHz and 16 kHz at the same time?

@mueller91
Copy link
Contributor

Here is the current model: Trained to 180k on LibriTTS Clean 100, 360 and 500; VCTK, Voxceleb1+2 and Mozilla Common Voice; a total of >25k speakers with 64 speakers per batch. Loss is at 0.25k.

loss

You can download the model and config at:
https://drive.google.com/file/d/1C8cXVEhra5WqEFArwTj-xFIgBn1GObxX/view?usp=sharing
https://drive.google.com/file/d/1q-igIrHvtqoKj6rRNljE7ChNta8hJInA/view?usp=sharing

@Edresson I can't do training at 22khz and 16khz at the same time because i have access to only a single GPU, and the current model (with 768 hidden layers and 64 speaker per batch) does not fit on my GPU twice.
Do you think Tacotron + Vocoder could work with 16khz?

@Edresson
Copy link
Contributor

@mueller91 It should work, but the quality may not be as good for real applications. If it is just for data generation I believe it is a good one. Perhaps it would be interesting to test how the Speaker encoder behaves by receiving 22khz audio instead of 16khz (my test was the opposite, a 22khz trained speaker encoder received a 16khz sample that was upresampled ).

If the performance loss is not great, we can use the trained 16 kHz speaker encoder to calculate the distance between speakers during training (speaker encoder extra loss) for a model trained in 22 kHz :)

@erogol
Copy link
Contributor Author

erogol commented Sep 22, 2020

@mueller91 it is a great contributing. Thanks!

i see that it was still converging. I guess you need the GPU as you stopped training.

@erogol
Copy link
Contributor Author

erogol commented Sep 22, 2020

@Edresson i still dont think we need a different sampling rate for the encoder. you can always resample an audio before computing the embedding vector.

@mueller91
Copy link
Contributor

@erogol I'll keep training, this was only a snapshot.

@Edresson I have not forgotten your request. However, i have only a single GPU available, and i would like to train the current model a bit more before I start with your config. Upsampling to 22khz introduces significant overhead when data loading; would 16khz and 80 mel_channels be helpful to you? This paper reports SOTA with 16khz.

@Edresson
Copy link
Contributor

Edresson commented Sep 22, 2020

@Edresson i still dont think we need a different sampling rate for the encoder. you can always resample an audio before computing the embedding vector.

@erogol The idea is to use it during the speaker encoder to calculate the loss during TTS training. And I don't know how to resample a spectrogram, so the ideal would be to have a speaker encoder trained in 22 kHz.

@mueller91 can focus on the 16khz Speaker encoder :). As I said above, there may not be a big difference in performance and we can use it in 22 kHz audio. I trained a 22khz model compatible with the TTS audio configuration on LibriTTS 360 and 100 clean, a while ago this model is not as good as yours but it works :).

@erogol
Copy link
Contributor Author

erogol commented Sep 22, 2020

@Edresson you dont need to resample spec. you resample audio and then compute the spec. Basically use separate Audio Processors for speaker encoder and the rest.

@mueller91
Copy link
Contributor

mueller91 commented Sep 23, 2020

I have further optimized the DataLoader, and now incur zero overhead when loading the data from disk (see LoaderTime); i train 1000steps in about 15 minutes (1.25 steps per second).

| > Step:1140  Loss:0.93590  AvgLoss:1.20784  GradNorm:58.14209  StepTime:0.77  LoaderTime:0.00  AvGLoaderTime:0.01  LR:0.000100
| > Step:1160  Loss:1.16374  AvgLoss:1.20117  GradNorm:54.96458  StepTime:0.77  LoaderTime:0.00  AvGLoaderTime:0.01  LR:0.000100
| > Step:1180  Loss:0.92916  AvgLoss:1.19200  GradNorm:52.99776  StepTime:0.78  LoaderTime:0.01  AvGLoaderTime:0.01  LR:0.000100

@Edresson I have started training the 80mel, 16khz speaker encoder; i'll keep you updated. Is the speaker-encoder based similarity loss already implemented?

@Edresson
Copy link
Contributor

@mueller91 Yes on one of my brachs. We intend to merge with TTS in the future :).

Are you training with this audio config here? Except the sample rate correct?

For the sample rate, @erogol had the idea of using interpolation as discussed in issue #520 , we can try this :).

@mueller91
Copy link
Contributor

mueller91 commented Sep 24, 2020

@Edresson yes, i used your audio config, except for the sampling rate and do_trim_silence, which i set to true

Edit: I noticed that changing the win-length from 400 to 1024 results in less frames given 1.6s of audio. Do you thing it makes sense to increase the length of the audio to maybe 2 or 3s during training? As far as i remember, the original paper reported improvements for longer audio files (up to 5s) during inference.

@erogol
Copy link
Contributor Author

erogol commented Nov 10, 2020

Hi @erogol, is the new Multi-Speaker-Tacotron2 DDC you released in the wiki using this new encoder ? I haven't seen a mention about the encoder used in the colab VCTK-Tacotron2_DDC-WaveGrad.ipynb. Thanks !

I should also mention that. Thanks for reminding me. Yes, I use the latest encoder.

@WeberJulian
Copy link
Contributor

Great ! Huge thanks to you @erogol and to you as well @mueller91 for the impressive work.

@george-roussos
Copy link
Contributor

Hi, does compute_embeddings.py not work with the model trained here? I tried to grab it and plug it in compute_embeddings.py, but I get

RuntimeError: Error(s) in loading state_dict for SpeakerEncoder:
	Missing key(s) in state_dict: "layers.0.lstm.weight_ih_l0", "layers.0.lstm.weight_hh_l0", "layers.0.lstm.bias_ih_l0", 
"layers.0.lstm.bias_hh_l0", "layers.0.linear.weight", "layers.1.lstm.weight_ih_l0", "layers.1.lstm.weight_hh_l0", "layers.1.lstm.bias_ih_l0", "layers.1.lstm.bias_hh_l0", "layers.1.linear.weight", "layers.2.lstm.weight_ih_l0", "layers.2.lstm.weight_hh_l0", "layers.2.lstm.bias_ih_l0", "layers.2.lstm.bias_hh_l0", "layers.2.linear.weight". 
	Unexpected key(s) in state_dict: "model", "optimizer", "step", "loss", "date". 

@lexkoro
Copy link
Contributor

lexkoro commented Nov 11, 2020

It works, I just used it. Sure you are using the right model?

@george-roussos
Copy link
Contributor

Yeah! I am using master and the models from the drive link (I tried all the models on the link) and compute_embeddings.py from a slightly older commit since it is not there now. I also tried model.py from dev but I got the same error

@lexkoro
Copy link
Contributor

lexkoro commented Nov 11, 2020

Ah, I used compute_embeddings.py from dev, which worked for me.

@george-roussos
Copy link
Contributor

Which commits are you using? The compute_embeddings.py is not there anymore

@lexkoro
Copy link
Contributor

lexkoro commented Nov 11, 2020

current dev https://github.com/mozilla/TTS/blob/dev/TTS/bin/compute_embeddings.py
guess the file was moved

@george-roussos
Copy link
Contributor

george-roussos commented Nov 11, 2020

Oh that's where it was 🤦🏻‍♂️ thanks mate. Very strange, still not working, even though I pulled the latest dev. It crashes at model.load_state_dict line and the only thing I changed was map the storage to CPU cos I am trying to load it on my laptop.

I just added strict=False and it seems to be doing the trick. Weird. Thanks a lot for trying to help. 🤗

@WeberJulian
Copy link
Contributor

Also @erogol, could you please tell us which of the checkpoints on the gdrive did you use to train the multispeaker model please ?

Is it the last one you added 320k or the one with the most steps 330k or the best_model which is a moth older than the 320k one. Thanks

@erogol
Copy link
Contributor Author

erogol commented Nov 13, 2020

I guess it is 320K. @Edresson has computed the embedding.

@lexkoro
Copy link
Contributor

lexkoro commented Nov 13, 2020

@WeberJulian The best_model was trained to ~370k steps. So I would assume it should be better?

@WeberJulian
Copy link
Contributor

@sanjaesc Yeah it's probably better but I'm fine-tuning the VTCK multispeaker model in my language so I need the exact checkpoint used to compute the embeddings even if they are worse or else my model won't work properly (I think).

@lexkoro
Copy link
Contributor

lexkoro commented Nov 13, 2020

Shouldn't a better speaker-encoder compute more accurate embeddings for your dataset and thus result in a more robust model?

@WeberJulian
Copy link
Contributor

I don't know since the embbedings don't mean the same thing anymore. I don't have enough speakers in my dataset to make the model learn (slightly?) different embeddings. I need a model that already know how to interpret the embbedings. At least that's my intuition. But if you think the newer checkpoint might work better I may try this after this training ends. thanks for the advice

@oytunturk
Copy link

Hi,

Thanks for the great effort! I'm experimenting with various multi-speaker TTS recipes shared in this project. Has anyone tried training a Tacotron model with LibriSpeech/LibriTTS data? Or, any other large scale US English dataset? I'm able to get decent results with VCTK based Tacotron model but it's limited to UK English and speaker variety is not sufficient for my application. I'm aware that we can create random speaker embeddings or even random style tokens if it's a GST based model but still I think when Tacotron sees only a limited number of speakers as in VCTK, all you can generate is limited to that speaker set in terms of speaker variation. If a larger scale Tacotron model hasn't been done, I might be able to put some effort in it and share a pre-trained model if it goes well. Any thoughts?

@WeberJulian
Copy link
Contributor

Hi, I think the model based on VCTK is the latest and greatest on this repo but it shouldn't be too hard to fine-tune on a larger dataset.

@oytunturk
Copy link

Hi, I think the model based on VCTK is the latest and greatest on this repo but it shouldn't be too hard to fine-tune on a larger dataset.

VCTK Tacotron model is based on UK English phoneme set. I don't exactly know what espeak does when you switch dialects but I'm guessing the phoneme sets will be different so training from scratch would be inevitable. Otherwise, Tacotron output will be based on UK English espeak pronunciations. It may not be as accurate as using US English, say if you are using LibriTTS for Tacotron training.

@WeberJulian
Copy link
Contributor

I think en-uk has more similar phonemes with en-us than it has different ones. I just tried transfer learning from this model to french and it works reasonably well so you shouldn't have any trouble for your use-case. Try the faster one first and if it doesn't suit you you can always take the longer path.

@oytunturk
Copy link

I think en-uk has more similar phonemes with en-us than it has different ones. I just tried transfer learning from this model to french and it works reasonably well so you shouldn't have any trouble for your use-case. Try the faster one first and if it doesn't suit you you can always take the longer path.

Yes, makes sense. My naive guess is that it will perform better than using characters as input but maybe a bit worse than the 'correct' phoneme set. Definition of a 'correct' phoneme set is also a bit fuzzy. It all depends how well it represents the pronunciations of speakers in your training database which may contain accented speech etc that you might be unaware of.

@george-roussos
Copy link
Contributor

george-roussos commented Dec 5, 2020

Hi, couple questions, especially @mueller91. I am trying to recreate the experiment with the same config, the same datasets and a handful of private speakers (not more than 120, so deffo not a lot). However, I am having issues initiating training. It seems that it freezes 15 minutes in; the RAM starts going up slowly (CPU allocation looks healthy), then fills up and the entire thing freezes. I have tried both with 4&8 workers and it did not work. I have a machine 8 vCPUs, 32GB RAM and a V100.

Thanks! And thanks for the model. 😀😅

@mueller91
Copy link
Contributor

Are you using my code, where part of the samples is kept in-memory to reduce I/O?
If yes, then it sounds like you're using up all your ram to cache the audio files. Have you tried decreasing the storage_size in the config?

@george-roussos
Copy link
Contributor

I am using dev branch, so I guess it has this, yes. Because I also tried your fork and got the same problem. I tried to decrease the storage_size to 15, but it didn't really do anything. And if it is lower, the I/O increases a lot. How large should RAM be to cache all the wavs no problem?

@mueller91
Copy link
Contributor

try decreasing it to zero and see if the RAM problem persists.
if you set storage_size to 15, it keeps 15 * (num_loaders) * batch_size utterances im memory (i think). Which is quite a lot.

and yeah, the I/O really is a problem. you really need SSDs for it.

@george-roussos
Copy link
Contributor

Actually, the SSD is not a problem, because I never run on HDD. So the problems i am getting are all using an SSD? How large a RAM did you use? Setting storage_size to 1 (0 is not accepted) works, but then the loss jumps to 0.60, even though I use the training sets you use. Did you only use the caching because you have an HDD?

@mueller91
Copy link
Contributor

i had to use caching because i use a HDD.
it is expected that the loss is larger for smaller storage size, since we re-use less of the training examples. every sample reused has been seen before, thus has been trained on before, thus lower loss.
i have 120gb ram.

@george-roussos
Copy link
Contributor

120 🤯 No wonder my small 30GB will not work! Thanks a lot for the clarification and for confirming it affects the loss.

@erogol
Copy link
Contributor Author

erogol commented Dec 8, 2020

you can use a swap space as an easy workaround

If you create it on SSD then it should be fast enough

@george-roussos
Copy link
Contributor

george-roussos commented Dec 21, 2020

Hi,

I was wondering if anybody has tried clustering in order to get a better understanding of what the network learns. I extracted some embeddings for my speakers and I tried clustering using HSBSCAN, but it only gives one (zero) label and then -1 which is apparently noise. This is what I have tried:

import numpy as np
import glob
import pandas as pd
import hdbscan
from joblib import Memory

embeddings = list()

for file in glob.glob("embeddings_o/*/*.npy"):
    embeddings.append(np.load(file))

dataframe = pd.DataFrame.from_records(np.vstack(embeddings))

clusterer = hdbscan.HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=40, memory=Memory(cachedir=None),
    metric='euclidean', min_cluster_size=15, min_samples=None, p=None).fit(dataframe)

clusterer = hdbscan.HDBSCAN(min_cluster_size=5).fit(dataframe)

print(clusterer.labels_)

and I get

[-1  0 -1 -1 -1 -1  1 -1 -1  0  0 -1  0 -1 -1  0 -1 -1 -1  0 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1  1 -1
  0 -1  1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1  0
  0  0 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1  0 -1  0 -1 -1 -1 -1 -1  0
  0 -1 -1 -1  1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1  1 -1  0 -1 -1 -1  0 -1  0
 -1  0  0  0 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

I set the min_cluster_size to 5 because anything higher only gives back the noise label. Maybe it indeed only has one label (and it is the pitch), but isn't it a bit weird that it doesn't learn anything else?

@george-roussos
Copy link
Contributor

@mueller91 Do you have a branch where the inter- and intra- losses are implemented? In a screenshot you shared above they are there, but they are not in dev or any other branch I tried and I am not sure how to implement them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion help wanted Extra attention is needed improvement a new feature
Projects
None yet
Development

No branches or pull requests

8 participants