Xvectors: DNN Embeddings for Speaker Recognition #1896

david-ryan-snyder · 2017-09-20T19:06:46Z

Overview
This pull request adds xvectors for speaker recognition. The system consists of a feedforward DNN with a statistics pooling layer. Training is multiclass cross entropy over the list of training speakers (we may add other methods in the future). After training, variable-length utterances are mapped to fixed-dimensional embeddings or “xvectors” and used in a PLDA backend. This is based on http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf, but includes recent enhancements not in that paper, such as data augmentation.

This PR also adds a new data augmentation script, which is important to achieve good performance in the xvector system. It is also helpful for ivectors (but only in PLDA training).

This PR adds a basic SRE16 recipe to demonstrate the system. An ivector system is in v1, and an xvector system is in v2.

Example Generation
An example consists of a chunk of speech features and the corresponding speaker label. Within an archive, all examples have the same chunk-size, but the chunk-size varies across archives. The relevant additions:

sid/nnet3/xvector/get_egs.sh — Top-level script for example creation
sid/nnet3/xvector/allocate_egs.py — This script is responsible for deciding what is contained in the examples and what archives they belong to.
src/nnet3bin/nnet3-xvector-get-egs — The binary for creating the examples. It constructs examples based on the ranges.* file.

Training
This version of xvectors is trained with multiclass cross entropy (softmax over the training speakers). Fortunately, steps/nnet3/train_raw_dnn.py is compatible with the egs created here, so no new code is needed for training. Relevant code:

sre16/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh — Does example creation, creates the xconfig, and trains the nnet

Extracting XVectors
After training, the xvectors are extracted from a specified layer of the DNN after the temporal pooling layer. Relevant additions

sid/nnet3/xvector/extract_xvectors.sh — Extracts embeddings from the xvector DNN. This is analogous to extract_ivectors.sh.
src/nnet3bin/nnet3-xvector-compute — Does the forward computation for the xvector DNN (variable-length input, with a single output).

Augmentation
We’ve found that embeddings almost always benefit from augmented training data. This appears to be true even when evaluated on clean telephone speech. Relevant additions:

steps/data/augment_data_dir.py — Similar to reverberate_data_dir.py but only handles additive noise.
egs/sre16/v1/run.sh — PLDA training list is augmented with reverb and MUSAN audio
egs/sre16/v2/run.sh — DNN training and PLDA list are augmented with reverb and MUSAN.

SRE16 Recipe
The PR includes a bare bones SRE16 recipe. The goal is primarily to demonstrate how to train and evaluate an xvector system. The version in egs/sre16/v1/ is a straightforward i-vector system. The recipe in egs/sre16/v2 contains the DNN embedding recipe. Relevant additions

egs/sre16/v1/local/ — A bunch of dataprep scripts
egs/sre16/v2/local/nnet3/xvector/prepare_feats_for_egs.sh -- A script that applies cmvn and removes silence frames and writes the results to disk. This is what the nnet examples are generated from.
egs/sre16/v1/run.sh — ivector top-level script
egs/sre16/v2/run.sh — xvector top-level script

Results for this example:

  xvector (from v2) EER: Pooled 8.76%, Tagalog 12.73%, Cantonese 4.86%
  ivector (from v1) EER: Pooled 12.98%, Tagalog 17.8%, Cantonese 8.35%

Note that the recipe is somewhat "bare bones." We could improve the results for the xvector system further by adding even more training data (e.g., Voxceleb: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/). Both systems would improve from updates to the backend such as adaptive score normalization or more effective PLDA domain adaptation techniques. However, I believe that is orthogonal to this PR.

…ng a recipe in egs/sre16. Adding script for data augmentation

… in steps/data/augment_data_dir.py, adding script to extract xvectors

…an support the xvector config. Also changing sid/nnet3/xvector/get_egs.sh so that it uses utt2num_frames instead of utt2len

…reco2dur

…he sre16 recipe

…or training the xvector system

…r recipe. NOTE that this is still a work in progress

…n-snyder/kaldi into kaldi-xvector-sep-2017

… script

…to extract_xvectors. Cosmetic improvements to several xvector scripts, and nnet3-xvector-compute

…pt prepare_feats_for_egs.sh which prepares features for xvector training

…v2} recipes

osadj · 2017-09-25T02:27:42Z

@david-ryan-snyder, which i-vector system is this? On SRE'16 evaluation set, we are getting 9.63% EER (minDCF~0.63) with a basic single i-vector system. It is promising to see you are reporting improvement with "xvectors" over your own i-vector system, but you also need to compare the result with that of a state-of-the-art system.

danpovey · 2017-09-25T02:38:00Z

Your 9.63% EER was pooled on Cantonese and Tagalog? And was there any nonstandard training data in there?

…

On Sun, Sep 24, 2017 at 10:27 PM, Omid Sadjadi ***@***.***> wrote: @david <https://github.com/david>, which i-vector system is this? On SRE'16 evaluation set, we are getting 9.63% EER (minDCF~0.63) with a basic single i-vector system. It is promising to see you are reporting improvement with "xvectors" over your own i-vector system, but you also need to compare the results with that of a state-of-the-art system. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu459EzkSX_8kF2dPdI8hvllisk7tks5slw-igaJpZM4PeVWs> .

osadj · 2017-09-25T02:42:03Z

Of course. And no data augmentation was used. We only used previous SRE data (SRE04-SRE10) with GMM based system. Please see NIST's presentation at SRE'16 workshop for more details.

danpovey · 2017-09-25T02:48:37Z

I think the point of the PR was to compare the embeddings, there was not a lot of attention given to the backend. As pointed out in the PR, there is a lot of score normalization and domain adaptation of the PLDA which is possible but which was not done in this PR. The numbers on sre10 are basically state of the art, to my knowledge. A paper on this is here http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf.

…

On Sun, Sep 24, 2017 at 10:42 PM, Omid Sadjadi ***@***.***> wrote: Of course. And no data augmentation was used. We only used previous SRE data (SRE04-SRE10) with GMM based system. Please see NIST's presentation at SRE'16 workshop for more details. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuzhS-_piaf043uzpJL93btLJ24EPks5slxL-gaJpZM4PeVWs> .

danpovey · 2017-09-29T18:51:02Z

I think you should try the full-dimensional MFCCs-- I won't hold up the PR over it, but it is the way we nornally do such things. Omid- I'm surprised that extra dimensions of MFCC did not help, esp. if this was an nnet system.

…

On Fri, Sep 29, 2017 at 2:49 PM, david-ryan-snyder ***@***.*** > wrote: I actually haven't tested this with num-fbanks== mfcc-dim. My guess is we can switch to using MFCCs in the embeddings recipe without loss of performance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1rvE1e896sLbwLC7ujsreoOA-P3ks5snTurgaJpZM4PeVWs> .

osadj · 2017-09-29T18:54:47Z

I do see benefits with 20-D MFCCs over 13-D MFCCs (for speaker reco) using GMMs, but not nnet. This observation has been consistent irrespective of the ASR toolkit used to train the nnet (i.e., Attila or Kaldi). On Fri, Sep 29, 2017 at 2:51 PM, Daniel Povey <notifications@github.com> wrote:

…

I think you should try the full-dimensional MFCCs-- I won't hold up the PR over it, but it is the way we nornally do such things. Omid- I'm surprised that extra dimensions of MFCC did not help, esp. if this was an nnet system. On Fri, Sep 29, 2017 at 2:49 PM, david-ryan-snyder < ***@***.*** > wrote: > I actually haven't tested this with num-fbanks== mfcc-dim. My guess is we > can switch to using MFCCs in the embeddings recipe without loss of > performance. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1896 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ ADJVu1rvE1e896sLbwLC7ujsreoOA-P3ks5snTurgaJpZM4PeVWs> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP60-CRG5uuzU_7zF853YbaJi_HJuXPQks5snTwggaJpZM4PeVWs> .

entn-at · 2017-09-29T18:58:32Z

@osadj For features used in calculating (GMM/DNN senone) posteriors or as features for calculating BW stats? Or are you talking about DNN embedding type systems such as this PR?

osadj · 2017-09-29T19:05:39Z

I am not talking about this PR (I have yet to try it, will do after installing the new P100s we got). I meant DNN based speaker reco systems in general (either for computing the posteriors or BNFs). For BW stats, if I use DNN posterior alignments, 20-D MFCC may or may not be better than 13-D MFCC, it depends on the task. On SRE10 I just used 13-D MFCCs to computed BW stats with DNN alignments (see IBM 2016 speaker reco system).

…

On Fri, Sep 29, 2017 at 2:58 PM, Ewald Enzinger ***@***.***> wrote: @osadj <https://github.com/osadj> For features used in calculating (GMM/DNN senone) posteriors or as features for calculating BW stats? Or are you talking about DNN embedding type systems such as this PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP60-A8UDO73QDUYiw_BXMWhOw7JibSWks5snT3kgaJpZM4PeVWs> .

david-ryan-snyder · 2017-09-29T20:49:09Z

@danpovey, FYI, I'm rerunning the v2 recipe with MFCCs (same dim as the filter banks). I think by the time we go through the PR, I'll be able to update the results (which I doubt will change much).

danpovey

Mostly very small comments.

danpovey · 2017-09-26T23:36:07Z

egs/sre08/v1/sid/nnet3/xvector/allocate_egs.py

+# of frames in each archive will be about the --frames-per-iter.
+#
+# This program will also output to the temp directory a file called
+# archive_chunk_length which tesll you the frame-length associated with


typo: tesll

danpovey · 2017-09-26T23:38:43Z

egs/sre08/v1/sid/nnet3/xvector/allocate_egs.py

+#
+# where each line is interpreted as follows:
+#  <source-utterance> <relative-archive-index> <absolute-archive-index> \
+#    <start-frame-index> <num-frames> <spkr-lable>


typo: lable

danpovey · 2017-09-26T23:58:13Z

egs/sre08/v1/sid/nnet3/xvector/get_egs.sh

+      --utt2int-filename=$dir/temp/utt2int.valid --egs-dir=$dir  || exit 1
+fi
+
+# You want to put an exit 1 command here and look at exp/$dir/temp/ranges.*


this comment maybe was temporary.

danpovey · 2017-09-26T23:58:44Z

egs/sre08/v1/sid/nnet3/xvector/get_egs.sh

+    nnet3-xvector-get-egs --compress=$compress --num-pdfs=$num_pdfs $temp/train_subset_ranges.1 \
+    "$train_subset_feats" $train_subset_outputs || touch $dir/.error &
+  wait
+  valid_outputs=`awk '{for(i=1;i<=NF;i++)printf("ark:%s ",$i);}' $temp/valid_outputs.1`


I prefer $() to backticks, but maybe not necessary to change.

danpovey · 2017-09-26T23:59:03Z

egs/sre08/v1/sid/nnet3/xvector/get_egs.sh

+  echo "$0: Shuffling order of archives on disk"
+  $cmd --max-jobs-run $nj JOB=1:$num_train_archives $dir/log/shuffle.JOB.log \
+    nnet3-shuffle-egs --srand=JOB ark:$dir/egs_temp.JOB.ark ark,scp:$dir/egs.JOB.ark,$dir/egs.JOB.scp || exit 1;
+  $cmd --max-jobs-run $nj JOB=1:$num_diagnostic_archives $dir/log/train_subset_shuffle.JOB.log \


these lines are on the long side.

danpovey · 2017-09-30T16:21:48Z

egs/sre08/v1/sid/nnet3/xvector/allocate_egs.py

@@ -0,0 +1,325 @@
+#!/usr/bin/env python


If this script would be trivial to make a python3 script, and you can test it, it would be great to have this be python3.
I'm encouraging python3 for new scripts now; we have accepted the dependency.

danpovey · 2017-09-30T16:22:34Z

egs/sre08/v1/sid/nnet3/xvector/extract_xvectors.sh

+# Apache 2.0.
+
+# This script extracts embeddings (called "xvectors" here) from a set of
+# utterances, given features and a trained DNN.


it would be great if you could summarize how it differs from the regular extract_ivectors.sh script

danpovey · 2017-09-30T16:23:27Z

egs/sre08/v1/sid/nnet3/xvector/get_egs.sh

+# Apache 2.0
+#
+# This script dumps training examples (egs) for multiclass xvector training.
+# These egs consist of a data chunk and a label. Each archive of egs has,


explain that the label corresponds to the speaker, and whether it's one-based or zero-based.

danpovey · 2017-09-30T16:25:03Z

egs/sre16/v1/local/make_musan.py

@@ -0,0 +1,96 @@
+#!/usr/bin/env python


if this script could trivially be upgraded to python3, that would be great;
I now prefer python3 for new scripts.

danpovey · 2017-09-30T16:43:49Z

src/nnet3bin/nnet3-xvector-get-egs.cc

+}
+
+// Delete the dynamically allocated memory.
+static void Cleanup(unordered_map<std::string,


Instead of having its own function I'd prefer to just put this in its own code block ({...}) with
a comment like
{ // Free memory
It feels like it should be part of the main function.
For the vector you can use DeletePointers() rather than writing a loop.

Actually you could modify the DeletePointers() template so that it works with any container, not just vectors. You could template it on the STL container, e.g.
template <typename C> ...
and internally do:

// T is the pointer type that the container contains, e.g. T == int*. typedef typename C::value_type T;

XXX forget that. It's a map not a set.

david-ryan-snyder · 2017-09-30T20:19:22Z

Thanks Dan! I'll look into addressing these.

dgromero · 2017-10-02T12:45:15Z

@danpovey, @david-ryan-snyder the reason why we are using Fbanks is to better support the current research we are doing in bandwidth extension/multi-bandwidth systems, as well as some frequency axis invariance (using convolutions). In the experiments, we saw that the results were very similar with MFCCs and Fbanks, but Fbanks are more in line with our future plans.

dgromero · 2017-10-02T13:04:31Z

@david-ryan-snyder, @danpovey I also wanted to mention that you are being too humble. xvectors are not only shining for short durations! The next round of ICASSP papers will show that. The main advantage of the xvectors (or in general, DNN embedding architectures) is that they are able to leverage large amounts of data much better than the ivector architectures. Specifically, data augmentation helps in the embedding training part (whereas in standard ivector systems it typically hurts, and it is the back-end that needs to handle the multi-condition/ data-augmented training). The reason is that augmenting data to train the UBM and T matrices is not such a good idea since it is an unsupervised task. However, the xvectors are trained in a supervised way with multiclass cross-entropy to discriminate among speakers. That makes the xvectors more invariant to those distorsions. Also, you can still use all the same back-end bag-of-tricks that we have accumulated over the years for ivector systems.

danpovey · 2017-10-02T16:42:21Z

Regarding use of fbanks versus MFCCs: what I recommend is to dump the MFCCs (for compression purposes), and then, if you need the original fbanks, you can use the "idct-layer". Vimal added it when we were adding CNN-related things.

…

On Mon, Oct 2, 2017 at 9:04 AM, Daniel Garcia-Romero < ***@***.***> wrote: @david-ryan-snyder <https://github.com/david-ryan-snyder>, @danpovey <https://github.com/danpovey> I also wanted to mention that you are being too humble. xvectors not only shining for short durations! The next round of ICASSP papers will show that. The main advantage of the xvectors (or in general, DNN embedding architectures) is that they are able to leverage large amounts of data much better than the ivector architectures. Specifically, data augmentation helps in the embedding training part (whereas in standard ivector systems it typically hurts, and it is the back-end that needs to handle the multi-condition/ data-augmented training). The reason is that augmenting data to train the UBM and T matrices is not such a good idea since it is an unsupervised task. However, the xvectors are trained in a supervised way with multi class cross-entropy to discriminate among speakers. That makes the xvectors more invariant to those distorsions. Also, you can still use all the same back-end bag-of-tricks that we have accumulated over the years for ivector systems. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu4vGQhXq9EMiXr9NZBBnCy2OPzSzks5soN9kgaJpZM4PeVWs> .

…n in sre16 recipe and xvector code. Upgrading new python scripts to python3

david-ryan-snyder · 2017-10-02T22:02:11Z

Except for the fbanks vs MFCC issue, I believe I've addressed all the comments. I'm still waiting for the experiment using full dim MFCCs to finish, and then this should be finished.

entn-at · 2017-10-03T16:35:09Z

I adapted the setup to SRE10, using the same corpora that are used in the sre10/v2 recipe (SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher) and got an EER of 2% (no further tuning, just using the same data augmentation procedure, fbank features, LDA dimension, no PLDA model adaptation (trained on SRE04-08) etc.).
Would be interesting to run this on the PRISM set or similar to see how it compares under noise/reverberation (but I assume you've already done those kinds of experiments - I'm just posting to confirm that it runs fine ;-)).

david-ryan-snyder · 2017-10-03T16:44:42Z

Here are the updated results using MFCCs instead of Fbanks. The results are no worse. Maybe it's even slightly better. I'll go ahead and change it over in the recipe.

  # -- Pooled --
  #                             Fbanks     MFCC
  # EER:                        8.97       8.66
  # min_Cprimary:               0.61       0.61
  # act_Cprimary:               0.63       0.62
  #
  # -- Cantonese --
  # EER:                        4.71       4.69
  # min_Cprimary:               0.42       0.42
  # act_Cprimary:               0.43       0.43
  #
  # -- Tagalog --
  # EER:                       13.23      12.63
  # min_Cprimary:               0.77       0.76
  # act_Cprimary:               0.82       0.81

osadj · 2017-10-03T16:45:35Z

Thanks, Ewald, for sharing this. It is good to know that the setup can be (easily?) adapted for other tasks as well. Can you also please provide further details on the SRE10 task, e.g., test condition (C1..C5 or pooled), male vs female (or pooled)? I understand the details are in the recipe ;-) but it would be great if you could share those.

…

On Tue, Oct 3, 2017 at 12:35 PM, Ewald Enzinger ***@***.***> wrote: I adapted the setup to SRE10, using the same corpora that are used in the sre10/v2 recipe (SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher) and got an EER of 2% (no further tuning, just using the same data augmentation procedure, fbank features, LDA dimension, no PLDA model adaptation (trained on SRE04-08) etc.). Would be interesting to run this on the PRISM set or similar to see how it compares under noise/reverberation (but I assume you've already done those kinds of experiments - I'm just posting to confirm that it runs fine ;-)). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP60-MgF_qtVmL9uQQkAXUmOfD3I3MIOks5somJXgaJpZM4PeVWs> .

david-ryan-snyder · 2017-10-03T16:46:53Z

@entn-at,

Very cool, thanks for running this!

I've done a similar experiment and got 1.7% EER on SRE10 (pooled, gender independent). I used different datasets (I included Voxceleb), though.

david-ryan-snyder · 2017-10-03T16:49:58Z

SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher

So you used Fisher in the xvector recipe? I've been wondering if this will work, since there aren't a lot of recordings per speaker. Maybe we could try an experiment where Fisher is augmented more aggressively than the other datasets (e.g., 3x or 4x). Might help to offset the limited data.

osadj · 2017-10-03T17:11:22Z

David, I do not find any details on the SRE10 task in run.sh or the README. Can you please add those information to either the README or run.sh where you report the results? Thanks.

…

On Tue, Oct 3, 2017 at 12:47 PM, david-ryan-snyder ***@***.*** > wrote: @entn-at <https://github.com/entn-at>, Very cool, thanks for running this! I've done a similar experiment and got 1.7% EER on SRE10. I used more datasets (+ Voxceleb), though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP60-P3yRO4Ia4m5i7uHYnYYJRhAKll1ks5somUTgaJpZM4PeVWs> .

entn-at · 2017-10-03T17:13:29Z

Omid: I ran it on C5 extended (same as the sre10/{v1,v2} recipes), pooled male/female (though it would be easy to get those numbers).

osadj · 2017-10-03T17:20:22Z

Thanks, Ewald. Is it then fair to compare the EER you reported to "# ind pooled: 1.01" (that is a gender-independent system evaluated on pooled male/female trials on C5)? hmm, EER~2% does not look right to me for a DNN based system.

…

On Tue, Oct 3, 2017 at 1:13 PM, Ewald Enzinger ***@***.***> wrote: Omid: I ran it on C5 extended (same as the sre10/{v1,v2} recipes), pooled male/female (though it would be easy to get those numbers). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1896 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP60-EpIBeAFs0c4d0DSAQwdwlLJIvzwks5somtVgaJpZM4PeVWs> .

entn-at · 2017-10-03T17:45:24Z

David: Yes, that is indeed the issue and the reason why most Fisher speakers get discarded during example preparation (I end up with 7170 speakers in the combined training set, 5767 of which are from SWB/SRE). More augmentation or treating Fisher differently when pruning the lists may make a big difference. Adding VoxCeleb is a good idea, I've used it in a different context for training in-domain PLDA/calibration. Another reason for the difference in EER I got could be that I reduced the number of training epochs from 3 to 2, as I only used 3 training jobs (num-jobs-initial/final=3).

Omid: Note that this was merely a proof-of-concept ("making it run"), I'm sure you can get better performance by tuning hyperparameters and adding more datasets. Without any tuning whatsoever, I get EER=2%, minDCF10=0.435
(for comparison, on the same set (SRE10, C5 extended, gender independent, pooled female/male) for the v1 recipe I got EER=2.26%, minDCF10=0.488, and for v2: EER=1.02%, minDCF10=0.197)

Edit: I included the run.sh as gist (https://gist.github.com/entn-at/486562edfa01f99a07a03d9c905a1a50). As you can see, it's very close to the original SRE16 script.

david-ryan-snyder · 2017-10-03T18:36:08Z

@osadj: I don't have the 1.7% EER on SRE10 using xvectors documented anywhere yet. In the future we'll probably work on a recipe for this (unless someone else finds a good one first, e.g., @entn-at :-)).

I think we can still do a lot better on SRE10 by using more augmentation on Fisher, additional datasets, etc. Maybe including ASR BNFs will help as well (although it's aesthetically undesirable, it's a fairer comparison with sre10/v2).

…ted the results (results are no worse). This was done because mfccs are more compressible on disk.

david-ryan-snyder · 2017-10-03T21:55:09Z

@danpovey, I've addressed those comments. I think the PR should be ready to ship now.

danpovey · 2017-10-03T22:00:52Z

thanks! Merged.

497354078 · 2018-01-04T11:55:05Z

Hi, I have a question about TDNN,
[sre16/v2/local/nnet3/xvector/run_xvector.sh] line 105
'''The stats pooling layer. Layers after this are segment-level.
In the config below, the first and last argument (0, and ${max_chunk_size})
means that we pool over an input segment starting at frame 0
and ending at frame ${max_chunk_size} or earlier. The other arguments (1:1)
mean that no subsampling is performed.'''
stats-layer name=stats config=mean+stddev(0:1:1:${max_chunk_size})
--this layer's output dims is 3001, but mean(1500) concat stddev(1500) is 3000dims. Can you tell the detail? thanks.

david-ryan-snyder · 2018-01-04T15:29:52Z

@497354078, in nnet3, this pooling layer is implemented as two components: StatisticsExtractionComponent and StatisticsPoolingComponent.

The first component computes sum(X), sum(X^2), and a count. That's why this component has 3001 rather than 3000 components (1500 + 1500 + 1). This is then received as input by the second layer, which uses these sums and the count to compute the avg(X) and stddev(X). The output of this second layer is actually 3000, as you would expect.

It might help to look at the forward computation for each component: https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L447 and https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L771

…aldi-asr#1896)

David Snyder and others added 23 commits September 14, 2017 16:30

[egs,src,scripts]: Adding a binaries for xvectors in nnet3bin. Creati…

a236731

…ng a recipe in egs/sre16. Adding script for data augmentation

[egs,scripts]: Adding run.sh for baseline sre16 recipe, fixing indent…

492c9b1

… in steps/data/augment_data_dir.py, adding script to extract xvectors

[scripts] adding get_egs.sh and allocate_egs.py for xvectors

44ef583

[src] improving documentation in xvector binaries, minor refactoring.

9fcb437

[scripts] fix to steps/libs/nnet3/xconfig/stats_layer.py so that it c…

c8e360b

…an support the xvector config. Also changing sid/nnet3/xvector/get_egs.sh so that it uses utt2num_frames instead of utt2len

[scripts] steps/data/augment_data_dir.py now uses utt2dur instead of …

8950906

…reco2dur

[egs] changing how utt2dur and utt2len are used in sre16 v1 run.sh

04b03ee

[egs,scripts] bug fix to extract_xvectors.sh, adding conf files for t…

cf350bd

…he sre16 recipe

[egs] adding script in local/nnet3/xvector/tuning/run_xvector_1a.sh f…

82c35d2

…or training the xvector system

[egs] adding egs/sre16/v2/run.sh, the top-level script for the xvecto…

6210a4d

…r recipe. NOTE that this is still a work in progress

Update run.sh

458593c

[egs] add more stages to egs/sre16/v2/run.sh. Still not finished

5298bb4

Merge branch 'kaldi-xvector-sep-2017' of https://github.com/david-rya…

56e673d

…n-snyder/kaldi into kaldi-xvector-sep-2017

[egs] add more stages to egs/sre16/v2/run.sh. Still not finished

591a7a4

[egs] in sre16/{v1,v2}/run.sh obviating need for a prepare_for_eer.py…

9c67f9d

… script

[egs,scripts,src] adding results to egs/sre16/v2. Adding gpu support …

a3ee1ac

…to extract_xvectors. Cosmetic improvements to several xvector scripts, and nnet3-xvector-compute

[scripts] bug fix to extract_xvectors.sh

5c4cb04

[egs,src] cosmetic improvements to nnet3-xvector-get-egs, adding scri…

647f41d

…pt prepare_feats_for_egs.sh which prepares features for xvector training

[egs] adding path.sh, cmd.sh and various symbolic links to sre16/{v1,…

63d350b

…v2} recipes

[egs] adding script to prepare Mixer6 Mic

596a94a

[egs] adding script to make Mixer 6 telephone calls

a509635

[egs] adding READMEs for the sre16 recipe

f5129d6

[egs] minor typos fixed in sre16 recipe

d4a5f3b

david-ryan-snyder changed the title ~~WIP Xvectors: DNN embeddings for Speaker Recognition~~ Xvectors: DNN embeddings for Speaker Recognition Sep 24, 2017

david-ryan-snyder changed the title ~~Xvectors: DNN embeddings for Speaker Recognition~~ Xvectors: DNN Embeddings for Speaker Recognition Sep 24, 2017

[src] cosmetic improvements to nnet3-xvector-get-egs

486850c

danpovey reviewed Sep 30, 2017

View reviewed changes

[egs,src,scripts] Fixes to sre16 data prep scripts. More documentatio…

c8b9c8d

…n in sre16 recipe and xvector code. Upgrading new python scripts to python3

[egs] changed features from fbanks to mfccs in egs/sre16/v2, and upda…

197235b

…ted the results (results are no worse). This was done because mfccs are more compressible on disk.

danpovey merged commit e082c17 into kaldi-asr:master Oct 3, 2017

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[src,scripts,egs] Xvectors: DNN Embeddings for Speaker Recognition (k…

3339838

…aldi-asr#1896)

ybdesire mentioned this pull request Mar 23, 2019

Any example for xvector by pykaldi? pykaldi/pykaldi#101

Closed

Xvectors: DNN Embeddings for Speaker Recognition #1896

Xvectors: DNN Embeddings for Speaker Recognition #1896

Conversation

david-ryan-snyder commented Sep 20, 2017 • edited Loading

osadj commented Sep 25, 2017 • edited Loading

danpovey commented Sep 25, 2017 via email

osadj commented Sep 25, 2017

danpovey commented Sep 25, 2017 via email

danpovey commented Sep 29, 2017 via email

osadj commented Sep 29, 2017 via email

entn-at commented Sep 29, 2017

osadj commented Sep 29, 2017 via email

david-ryan-snyder commented Sep 29, 2017

danpovey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey Sep 30, 2017 • edited Loading

Choose a reason for hiding this comment

david-ryan-snyder commented Sep 30, 2017

dgromero commented Oct 2, 2017 • edited Loading

dgromero commented Oct 2, 2017 • edited Loading

danpovey commented Oct 2, 2017 via email

david-ryan-snyder commented Oct 2, 2017

entn-at commented Oct 3, 2017

david-ryan-snyder commented Oct 3, 2017

osadj commented Oct 3, 2017 via email

david-ryan-snyder commented Oct 3, 2017 • edited Loading

david-ryan-snyder commented Oct 3, 2017 • edited Loading

osadj commented Oct 3, 2017 via email

entn-at commented Oct 3, 2017

osadj commented Oct 3, 2017 via email

entn-at commented Oct 3, 2017 • edited Loading

david-ryan-snyder commented Oct 3, 2017

david-ryan-snyder commented Oct 3, 2017

danpovey commented Oct 3, 2017

497354078 commented Jan 4, 2018 • edited Loading

david-ryan-snyder commented Jan 4, 2018 • edited Loading

david-ryan-snyder commented Sep 20, 2017 •

edited

Loading

osadj commented Sep 25, 2017 •

edited

Loading

danpovey Sep 30, 2017 •

edited

Loading

dgromero commented Oct 2, 2017 •

edited

Loading

dgromero commented Oct 2, 2017 •

edited

Loading

david-ryan-snyder commented Oct 3, 2017 •

edited

Loading

david-ryan-snyder commented Oct 3, 2017 •

edited

Loading

entn-at commented Oct 3, 2017 •

edited

Loading

497354078 commented Jan 4, 2018 •

edited

Loading

david-ryan-snyder commented Jan 4, 2018 •

edited

Loading