-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xvectors: DNN Embeddings for Speaker Recognition #1896
Xvectors: DNN Embeddings for Speaker Recognition #1896
Conversation
…ng a recipe in egs/sre16. Adding script for data augmentation
… in steps/data/augment_data_dir.py, adding script to extract xvectors
…an support the xvector config. Also changing sid/nnet3/xvector/get_egs.sh so that it uses utt2num_frames instead of utt2len
…or training the xvector system
…r recipe. NOTE that this is still a work in progress
…n-snyder/kaldi into kaldi-xvector-sep-2017
…to extract_xvectors. Cosmetic improvements to several xvector scripts, and nnet3-xvector-compute
…pt prepare_feats_for_egs.sh which prepares features for xvector training
@david-ryan-snyder, which i-vector system is this? On SRE'16 evaluation set, we are getting 9.63% EER (minDCF~0.63) with a basic single i-vector system. It is promising to see you are reporting improvement with "xvectors" over your own i-vector system, but you also need to compare the result with that of a state-of-the-art system. |
Your 9.63% EER was pooled on Cantonese and Tagalog?
And was there any nonstandard training data in there?
…On Sun, Sep 24, 2017 at 10:27 PM, Omid Sadjadi ***@***.***> wrote:
@david <https://github.com/david>, which i-vector system is this? On
SRE'16 evaluation set, we are getting 9.63% EER (minDCF~0.63) with a basic
single i-vector system. It is promising to see you are reporting
improvement with "xvectors" over your own i-vector system, but you also
need to compare the results with that of a state-of-the-art system.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu459EzkSX_8kF2dPdI8hvllisk7tks5slw-igaJpZM4PeVWs>
.
|
Of course. And no data augmentation was used. We only used previous SRE data (SRE04-SRE10) with GMM based system. Please see NIST's presentation at SRE'16 workshop for more details. |
I think the point of the PR was to compare the embeddings, there was not a
lot of attention given to the backend. As pointed out in the PR, there is
a lot of score normalization and domain adaptation of the PLDA which is
possible but which was not done in this PR. The numbers on sre10 are
basically state of the art, to my knowledge.
A paper on this is here
http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf.
…On Sun, Sep 24, 2017 at 10:42 PM, Omid Sadjadi ***@***.***> wrote:
Of course. And no data augmentation was used. We only used previous SRE
data (SRE04-SRE10) with GMM based system. Please see NIST's presentation at
SRE'16 workshop for more details.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuzhS-_piaf043uzpJL93btLJ24EPks5slxL-gaJpZM4PeVWs>
.
|
I think you should try the full-dimensional MFCCs-- I won't hold up the PR
over it, but it is the way we nornally do such things.
Omid- I'm surprised that extra dimensions of MFCC did not help, esp. if
this was an nnet system.
…On Fri, Sep 29, 2017 at 2:49 PM, david-ryan-snyder ***@***.*** > wrote:
I actually haven't tested this with num-fbanks== mfcc-dim. My guess is we
can switch to using MFCCs in the embeddings recipe without loss of
performance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu1rvE1e896sLbwLC7ujsreoOA-P3ks5snTurgaJpZM4PeVWs>
.
|
I do see benefits with 20-D MFCCs over 13-D MFCCs (for speaker reco) using
GMMs, but not nnet. This observation has been consistent irrespective of
the ASR toolkit used to train the nnet (i.e., Attila or Kaldi).
On Fri, Sep 29, 2017 at 2:51 PM, Daniel Povey <notifications@github.com>
wrote:
… I think you should try the full-dimensional MFCCs-- I won't hold up the PR
over it, but it is the way we nornally do such things.
Omid- I'm surprised that extra dimensions of MFCC did not help, esp. if
this was an nnet system.
On Fri, Sep 29, 2017 at 2:49 PM, david-ryan-snyder <
***@***.***
> wrote:
> I actually haven't tested this with num-fbanks== mfcc-dim. My guess is we
> can switch to using MFCCs in the embeddings recipe without loss of
> performance.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1896 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/
ADJVu1rvE1e896sLbwLC7ujsreoOA-P3ks5snTurgaJpZM4PeVWs>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AP60-CRG5uuzU_7zF853YbaJi_HJuXPQks5snTwggaJpZM4PeVWs>
.
|
@osadj For features used in calculating (GMM/DNN senone) posteriors or as features for calculating BW stats? Or are you talking about DNN embedding type systems such as this PR? |
I am not talking about this PR (I have yet to try it, will do after
installing the new P100s we got). I meant DNN based speaker reco systems in
general (either for computing the posteriors or BNFs). For BW stats, if I
use DNN posterior alignments, 20-D MFCC may or may not be better than 13-D
MFCC, it depends on the task. On SRE10 I just used 13-D MFCCs to computed
BW stats with DNN alignments (see IBM 2016 speaker reco system).
…On Fri, Sep 29, 2017 at 2:58 PM, Ewald Enzinger ***@***.***> wrote:
@osadj <https://github.com/osadj> For features used in calculating
(GMM/DNN senone) posteriors or as features for calculating BW stats? Or are
you talking about DNN embedding type systems such as this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AP60-A8UDO73QDUYiw_BXMWhOw7JibSWks5snT3kgaJpZM4PeVWs>
.
|
@danpovey, FYI, I'm rerunning the v2 recipe with MFCCs (same dim as the filter banks). I think by the time we go through the PR, I'll be able to update the results (which I doubt will change much). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly very small comments.
# of frames in each archive will be about the --frames-per-iter. | ||
# | ||
# This program will also output to the temp directory a file called | ||
# archive_chunk_length which tesll you the frame-length associated with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: tesll
# | ||
# where each line is interpreted as follows: | ||
# <source-utterance> <relative-archive-index> <absolute-archive-index> \ | ||
# <start-frame-index> <num-frames> <spkr-lable> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: lable
--utt2int-filename=$dir/temp/utt2int.valid --egs-dir=$dir || exit 1 | ||
fi | ||
|
||
# You want to put an exit 1 command here and look at exp/$dir/temp/ranges.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment maybe was temporary.
nnet3-xvector-get-egs --compress=$compress --num-pdfs=$num_pdfs $temp/train_subset_ranges.1 \ | ||
"$train_subset_feats" $train_subset_outputs || touch $dir/.error & | ||
wait | ||
valid_outputs=`awk '{for(i=1;i<=NF;i++)printf("ark:%s ",$i);}' $temp/valid_outputs.1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer $() to backticks, but maybe not necessary to change.
echo "$0: Shuffling order of archives on disk" | ||
$cmd --max-jobs-run $nj JOB=1:$num_train_archives $dir/log/shuffle.JOB.log \ | ||
nnet3-shuffle-egs --srand=JOB ark:$dir/egs_temp.JOB.ark ark,scp:$dir/egs.JOB.ark,$dir/egs.JOB.scp || exit 1; | ||
$cmd --max-jobs-run $nj JOB=1:$num_diagnostic_archives $dir/log/train_subset_shuffle.JOB.log \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines are on the long side.
@@ -0,0 +1,325 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this script would be trivial to make a python3 script, and you can test it, it would be great to have this be python3.
I'm encouraging python3 for new scripts now; we have accepted the dependency.
# Apache 2.0. | ||
|
||
# This script extracts embeddings (called "xvectors" here) from a set of | ||
# utterances, given features and a trained DNN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if you could summarize how it differs from the regular extract_ivectors.sh script
# Apache 2.0 | ||
# | ||
# This script dumps training examples (egs) for multiclass xvector training. | ||
# These egs consist of a data chunk and a label. Each archive of egs has, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain that the label corresponds to the speaker, and whether it's one-based or zero-based.
egs/sre16/v1/local/make_musan.py
Outdated
@@ -0,0 +1,96 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this script could trivially be upgraded to python3, that would be great;
I now prefer python3 for new scripts.
} | ||
|
||
// Delete the dynamically allocated memory. | ||
static void Cleanup(unordered_map<std::string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having its own function I'd prefer to just put this in its own code block ({...}) with
a comment like
{ // Free memory
It feels like it should be part of the main function.
For the vector you can use DeletePointers() rather than writing a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you could modify the DeletePointers() template so that it works with any container, not just vectors. You could template it on the STL container, e.g.
template <typename C>
...
and internally do:
// T is the pointer type that the container contains, e.g. T == int*.
typedef typename C::value_type T;
XXX forget that. It's a map not a set.
Thanks Dan! I'll look into addressing these. |
@danpovey, @david-ryan-snyder the reason why we are using Fbanks is to better support the current research we are doing in bandwidth extension/multi-bandwidth systems, as well as some frequency axis invariance (using convolutions). In the experiments, we saw that the results were very similar with MFCCs and Fbanks, but Fbanks are more in line with our future plans. |
@david-ryan-snyder, @danpovey I also wanted to mention that you are being too humble. xvectors are not only shining for short durations! The next round of ICASSP papers will show that. The main advantage of the xvectors (or in general, DNN embedding architectures) is that they are able to leverage large amounts of data much better than the ivector architectures. Specifically, data augmentation helps in the embedding training part (whereas in standard ivector systems it typically hurts, and it is the back-end that needs to handle the multi-condition/ data-augmented training). The reason is that augmenting data to train the UBM and T matrices is not such a good idea since it is an unsupervised task. However, the xvectors are trained in a supervised way with multiclass cross-entropy to discriminate among speakers. That makes the xvectors more invariant to those distorsions. Also, you can still use all the same back-end bag-of-tricks that we have accumulated over the years for ivector systems. |
Regarding use of fbanks versus MFCCs: what I recommend is to dump the MFCCs
(for compression purposes), and then, if you need the original fbanks, you
can use the "idct-layer". Vimal added it when we were adding CNN-related
things.
…On Mon, Oct 2, 2017 at 9:04 AM, Daniel Garcia-Romero < ***@***.***> wrote:
@david-ryan-snyder <https://github.com/david-ryan-snyder>, @danpovey
<https://github.com/danpovey> I also wanted to mention that you are being
too humble. xvectors not only shining for short durations! The next round
of ICASSP papers will show that. The main advantage of the xvectors (or in
general, DNN embedding architectures) is that they are able to leverage
large amounts of data much better than the ivector architectures.
Specifically, data augmentation helps in the embedding training part
(whereas in standard ivector systems it typically hurts, and it is the
back-end that needs to handle the multi-condition/ data-augmented
training). The reason is that augmenting data to train the UBM and T
matrices is not such a good idea since it is an unsupervised task. However,
the xvectors are trained in a supervised way with multi class cross-entropy
to discriminate among speakers. That makes the xvectors more invariant to
those distorsions. Also, you can still use all the same back-end
bag-of-tricks that we have accumulated over the years for ivector systems.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu4vGQhXq9EMiXr9NZBBnCy2OPzSzks5soN9kgaJpZM4PeVWs>
.
|
…n in sre16 recipe and xvector code. Upgrading new python scripts to python3
Except for the fbanks vs MFCC issue, I believe I've addressed all the comments. I'm still waiting for the experiment using full dim MFCCs to finish, and then this should be finished. |
I adapted the setup to SRE10, using the same corpora that are used in the sre10/v2 recipe (SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher) and got an EER of 2% (no further tuning, just using the same data augmentation procedure, fbank features, LDA dimension, no PLDA model adaptation (trained on SRE04-08) etc.). |
Here are the updated results using MFCCs instead of Fbanks. The results are no worse. Maybe it's even slightly better. I'll go ahead and change it over in the recipe.
|
Thanks, Ewald, for sharing this. It is good to know that the setup can be
(easily?) adapted for other tasks as well. Can you also please provide
further details on the SRE10 task, e.g., test condition (C1..C5 or pooled),
male vs female (or pooled)? I understand the details are in the recipe ;-)
but it would be great if you could share those.
…On Tue, Oct 3, 2017 at 12:35 PM, Ewald Enzinger ***@***.***> wrote:
I adapted the setup to SRE10, using the same corpora that are used in the
sre10/v2 recipe (SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher) and got an EER
of 2% (no further tuning, just using the same data augmentation procedure,
fbank features, LDA dimension, no PLDA model adaptation (trained on
SRE04-08) etc.).
Would be interesting to run this on the PRISM set or similar to see how it
compares under noise/reverberation (but I assume you've already done those
kinds of experiments - I'm just posting to confirm that it runs fine ;-)).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AP60-MgF_qtVmL9uQQkAXUmOfD3I3MIOks5somJXgaJpZM4PeVWs>
.
|
Very cool, thanks for running this! I've done a similar experiment and got 1.7% EER on SRE10 (pooled, gender independent). I used different datasets (I included Voxceleb), though. |
So you used Fisher in the xvector recipe? I've been wondering if this will work, since there aren't a lot of recordings per speaker. Maybe we could try an experiment where Fisher is augmented more aggressively than the other datasets (e.g., 3x or 4x). Might help to offset the limited data. |
David, I do not find any details on the SRE10 task in run.sh or the README.
Can you please add those information to either the README or run.sh where
you report the results? Thanks.
…On Tue, Oct 3, 2017 at 12:47 PM, david-ryan-snyder ***@***.*** > wrote:
@entn-at <https://github.com/entn-at>,
Very cool, thanks for running this!
I've done a similar experiment and got 1.7% EER on SRE10. I used more
datasets (+ Voxceleb), though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AP60-P3yRO4Ia4m5i7uHYnYYJRhAKll1ks5somUTgaJpZM4PeVWs>
.
|
Omid: I ran it on C5 extended (same as the sre10/{v1,v2} recipes), pooled male/female (though it would be easy to get those numbers). |
Thanks, Ewald. Is it then fair to compare the EER you reported to "# ind
pooled: 1.01" (that is a gender-independent system evaluated on pooled
male/female trials on C5)? hmm, EER~2% does not look right to me for a DNN
based system.
…On Tue, Oct 3, 2017 at 1:13 PM, Ewald Enzinger ***@***.***> wrote:
Omid: I ran it on C5 extended (same as the sre10/{v1,v2} recipes), pooled
male/female (though it would be easy to get those numbers).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1896 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AP60-EpIBeAFs0c4d0DSAQwdwlLJIvzwks5somtVgaJpZM4PeVWs>
.
|
David: Yes, that is indeed the issue and the reason why most Fisher speakers get discarded during example preparation (I end up with 7170 speakers in the combined training set, 5767 of which are from SWB/SRE). More augmentation or treating Fisher differently when pruning the lists may make a big difference. Adding VoxCeleb is a good idea, I've used it in a different context for training in-domain PLDA/calibration. Another reason for the difference in EER I got could be that I reduced the number of training epochs from 3 to 2, as I only used 3 training jobs (num-jobs-initial/final=3). Omid: Note that this was merely a proof-of-concept ("making it run"), I'm sure you can get better performance by tuning hyperparameters and adding more datasets. Without any tuning whatsoever, I get EER=2%, minDCF10=0.435 Edit: I included the run.sh as gist (https://gist.github.com/entn-at/486562edfa01f99a07a03d9c905a1a50). As you can see, it's very close to the original SRE16 script. |
@osadj: I don't have the 1.7% EER on SRE10 using xvectors documented anywhere yet. In the future we'll probably work on a recipe for this (unless someone else finds a good one first, e.g., @entn-at :-)). I think we can still do a lot better on SRE10 by using more augmentation on Fisher, additional datasets, etc. Maybe including ASR BNFs will help as well (although it's aesthetically undesirable, it's a fairer comparison with sre10/v2). |
…ted the results (results are no worse). This was done because mfccs are more compressible on disk.
@danpovey, I've addressed those comments. I think the PR should be ready to ship now. |
thanks! Merged. |
Hi, I have a question about TDNN, |
@497354078, in nnet3, this pooling layer is implemented as two components: StatisticsExtractionComponent and StatisticsPoolingComponent. The first component computes sum(X), sum(X^2), and a count. That's why this component has 3001 rather than 3000 components (1500 + 1500 + 1). This is then received as input by the second layer, which uses these sums and the count to compute the avg(X) and stddev(X). The output of this second layer is actually 3000, as you would expect. It might help to look at the forward computation for each component: https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L447 and https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L771 |
Overview
This pull request adds xvectors for speaker recognition. The system consists of a feedforward DNN with a statistics pooling layer. Training is multiclass cross entropy over the list of training speakers (we may add other methods in the future). After training, variable-length utterances are mapped to fixed-dimensional embeddings or “xvectors” and used in a PLDA backend. This is based on http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf, but includes recent enhancements not in that paper, such as data augmentation.
This PR also adds a new data augmentation script, which is important to achieve good performance in the xvector system. It is also helpful for ivectors (but only in PLDA training).
This PR adds a basic SRE16 recipe to demonstrate the system. An ivector system is in v1, and an xvector system is in v2.
Example Generation
An example consists of a chunk of speech features and the corresponding speaker label. Within an archive, all examples have the same chunk-size, but the chunk-size varies across archives. The relevant additions:
Training
This version of xvectors is trained with multiclass cross entropy (softmax over the training speakers). Fortunately, steps/nnet3/train_raw_dnn.py is compatible with the egs created here, so no new code is needed for training. Relevant code:
Extracting XVectors
After training, the xvectors are extracted from a specified layer of the DNN after the temporal pooling layer. Relevant additions
Augmentation
We’ve found that embeddings almost always benefit from augmented training data. This appears to be true even when evaluated on clean telephone speech. Relevant additions:
SRE16 Recipe
The PR includes a bare bones SRE16 recipe. The goal is primarily to demonstrate how to train and evaluate an xvector system. The version in egs/sre16/v1/ is a straightforward i-vector system. The recipe in egs/sre16/v2 contains the DNN embedding recipe. Relevant additions
Results for this example:
Note that the recipe is somewhat "bare bones." We could improve the results for the xvector system further by adding even more training data (e.g., Voxceleb: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/). Both systems would improve from updates to the backend such as adaptive score normalization or more effective PLDA domain adaptation techniques. However, I believe that is orthogonal to this PR.