Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xvectors: DNN Embeddings for Speaker Recognition #1896

Merged
merged 29 commits into from
Oct 3, 2017

Conversation

david-ryan-snyder
Copy link
Contributor

@david-ryan-snyder david-ryan-snyder commented Sep 20, 2017

Overview
This pull request adds xvectors for speaker recognition. The system consists of a feedforward DNN with a statistics pooling layer. Training is multiclass cross entropy over the list of training speakers (we may add other methods in the future). After training, variable-length utterances are mapped to fixed-dimensional embeddings or “xvectors” and used in a PLDA backend. This is based on http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf, but includes recent enhancements not in that paper, such as data augmentation.

This PR also adds a new data augmentation script, which is important to achieve good performance in the xvector system. It is also helpful for ivectors (but only in PLDA training).

This PR adds a basic SRE16 recipe to demonstrate the system. An ivector system is in v1, and an xvector system is in v2.

Example Generation
An example consists of a chunk of speech features and the corresponding speaker label. Within an archive, all examples have the same chunk-size, but the chunk-size varies across archives. The relevant additions:

  • sid/nnet3/xvector/get_egs.sh — Top-level script for example creation
  • sid/nnet3/xvector/allocate_egs.py — This script is responsible for deciding what is contained in the examples and what archives they belong to.
  • src/nnet3bin/nnet3-xvector-get-egs — The binary for creating the examples. It constructs examples based on the ranges.* file.

Training
This version of xvectors is trained with multiclass cross entropy (softmax over the training speakers). Fortunately, steps/nnet3/train_raw_dnn.py is compatible with the egs created here, so no new code is needed for training. Relevant code:

  • sre16/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh — Does example creation, creates the xconfig, and trains the nnet

Extracting XVectors
After training, the xvectors are extracted from a specified layer of the DNN after the temporal pooling layer. Relevant additions

  • sid/nnet3/xvector/extract_xvectors.sh — Extracts embeddings from the xvector DNN. This is analogous to extract_ivectors.sh.
  • src/nnet3bin/nnet3-xvector-compute — Does the forward computation for the xvector DNN (variable-length input, with a single output).

Augmentation
We’ve found that embeddings almost always benefit from augmented training data. This appears to be true even when evaluated on clean telephone speech. Relevant additions:

  • steps/data/augment_data_dir.py — Similar to reverberate_data_dir.py but only handles additive noise.
  • egs/sre16/v1/run.sh — PLDA training list is augmented with reverb and MUSAN audio
  • egs/sre16/v2/run.sh — DNN training and PLDA list are augmented with reverb and MUSAN.

SRE16 Recipe
The PR includes a bare bones SRE16 recipe. The goal is primarily to demonstrate how to train and evaluate an xvector system. The version in egs/sre16/v1/ is a straightforward i-vector system. The recipe in egs/sre16/v2 contains the DNN embedding recipe. Relevant additions

  • egs/sre16/v1/local/ — A bunch of dataprep scripts
  • egs/sre16/v2/local/nnet3/xvector/prepare_feats_for_egs.sh -- A script that applies cmvn and removes silence frames and writes the results to disk. This is what the nnet examples are generated from.
  • egs/sre16/v1/run.sh — ivector top-level script
  • egs/sre16/v2/run.sh — xvector top-level script

Results for this example:

  xvector (from v2) EER: Pooled 8.76%, Tagalog 12.73%, Cantonese 4.86%
  ivector (from v1) EER: Pooled 12.98%, Tagalog 17.8%, Cantonese 8.35%

Note that the recipe is somewhat "bare bones." We could improve the results for the xvector system further by adding even more training data (e.g., Voxceleb: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/). Both systems would improve from updates to the backend such as adaptive score normalization or more effective PLDA domain adaptation techniques. However, I believe that is orthogonal to this PR.

David Snyder and others added 23 commits September 14, 2017 16:30
…ng a recipe in egs/sre16. Adding script for data augmentation
… in steps/data/augment_data_dir.py, adding script to extract xvectors
…an support the xvector config. Also changing sid/nnet3/xvector/get_egs.sh so that it uses utt2num_frames instead of utt2len
…r recipe. NOTE that this is still a work in progress
…to extract_xvectors. Cosmetic improvements to several xvector scripts, and nnet3-xvector-compute
…pt prepare_feats_for_egs.sh which prepares features for xvector training
@david-ryan-snyder david-ryan-snyder changed the title WIP Xvectors: DNN embeddings for Speaker Recognition Xvectors: DNN embeddings for Speaker Recognition Sep 24, 2017
@david-ryan-snyder david-ryan-snyder changed the title Xvectors: DNN embeddings for Speaker Recognition Xvectors: DNN Embeddings for Speaker Recognition Sep 24, 2017
@osadj
Copy link
Contributor

osadj commented Sep 25, 2017

@david-ryan-snyder, which i-vector system is this? On SRE'16 evaluation set, we are getting 9.63% EER (minDCF~0.63) with a basic single i-vector system. It is promising to see you are reporting improvement with "xvectors" over your own i-vector system, but you also need to compare the result with that of a state-of-the-art system.

@danpovey
Copy link
Contributor

danpovey commented Sep 25, 2017 via email

@osadj
Copy link
Contributor

osadj commented Sep 25, 2017

Of course. And no data augmentation was used. We only used previous SRE data (SRE04-SRE10) with GMM based system. Please see NIST's presentation at SRE'16 workshop for more details.

@danpovey
Copy link
Contributor

danpovey commented Sep 25, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Sep 29, 2017 via email

@osadj
Copy link
Contributor

osadj commented Sep 29, 2017 via email

@entn-at
Copy link
Contributor

entn-at commented Sep 29, 2017

@osadj For features used in calculating (GMM/DNN senone) posteriors or as features for calculating BW stats? Or are you talking about DNN embedding type systems such as this PR?

@osadj
Copy link
Contributor

osadj commented Sep 29, 2017 via email

@david-ryan-snyder
Copy link
Contributor Author

@danpovey, FYI, I'm rerunning the v2 recipe with MFCCs (same dim as the filter banks). I think by the time we go through the PR, I'll be able to update the results (which I doubt will change much).

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly very small comments.

# of frames in each archive will be about the --frames-per-iter.
#
# This program will also output to the temp directory a file called
# archive_chunk_length which tesll you the frame-length associated with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: tesll

#
# where each line is interpreted as follows:
# <source-utterance> <relative-archive-index> <absolute-archive-index> \
# <start-frame-index> <num-frames> <spkr-lable>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: lable

--utt2int-filename=$dir/temp/utt2int.valid --egs-dir=$dir || exit 1
fi

# You want to put an exit 1 command here and look at exp/$dir/temp/ranges.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment maybe was temporary.

nnet3-xvector-get-egs --compress=$compress --num-pdfs=$num_pdfs $temp/train_subset_ranges.1 \
"$train_subset_feats" $train_subset_outputs || touch $dir/.error &
wait
valid_outputs=`awk '{for(i=1;i<=NF;i++)printf("ark:%s ",$i);}' $temp/valid_outputs.1`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer $() to backticks, but maybe not necessary to change.

echo "$0: Shuffling order of archives on disk"
$cmd --max-jobs-run $nj JOB=1:$num_train_archives $dir/log/shuffle.JOB.log \
nnet3-shuffle-egs --srand=JOB ark:$dir/egs_temp.JOB.ark ark,scp:$dir/egs.JOB.ark,$dir/egs.JOB.scp || exit 1;
$cmd --max-jobs-run $nj JOB=1:$num_diagnostic_archives $dir/log/train_subset_shuffle.JOB.log \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines are on the long side.

@@ -0,0 +1,325 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this script would be trivial to make a python3 script, and you can test it, it would be great to have this be python3.
I'm encouraging python3 for new scripts now; we have accepted the dependency.

# Apache 2.0.

# This script extracts embeddings (called "xvectors" here) from a set of
# utterances, given features and a trained DNN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if you could summarize how it differs from the regular extract_ivectors.sh script

# Apache 2.0
#
# This script dumps training examples (egs) for multiclass xvector training.
# These egs consist of a data chunk and a label. Each archive of egs has,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain that the label corresponds to the speaker, and whether it's one-based or zero-based.

@@ -0,0 +1,96 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this script could trivially be upgraded to python3, that would be great;
I now prefer python3 for new scripts.

}

// Delete the dynamically allocated memory.
static void Cleanup(unordered_map<std::string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having its own function I'd prefer to just put this in its own code block ({...}) with
a comment like
{ // Free memory
It feels like it should be part of the main function.
For the vector you can use DeletePointers() rather than writing a loop.

Copy link
Contributor

@danpovey danpovey Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you could modify the DeletePointers() template so that it works with any container, not just vectors. You could template it on the STL container, e.g.
template <typename C> ...
and internally do:

  // T is the pointer type that the container contains, e.g. T == int*.
  typedef typename C::value_type T;

XXX forget that. It's a map not a set.

@david-ryan-snyder
Copy link
Contributor Author

Thanks Dan! I'll look into addressing these.

@dgromero
Copy link

dgromero commented Oct 2, 2017

@danpovey, @david-ryan-snyder the reason why we are using Fbanks is to better support the current research we are doing in bandwidth extension/multi-bandwidth systems, as well as some frequency axis invariance (using convolutions). In the experiments, we saw that the results were very similar with MFCCs and Fbanks, but Fbanks are more in line with our future plans.

@dgromero
Copy link

dgromero commented Oct 2, 2017

@david-ryan-snyder, @danpovey I also wanted to mention that you are being too humble. xvectors are not only shining for short durations! The next round of ICASSP papers will show that. The main advantage of the xvectors (or in general, DNN embedding architectures) is that they are able to leverage large amounts of data much better than the ivector architectures. Specifically, data augmentation helps in the embedding training part (whereas in standard ivector systems it typically hurts, and it is the back-end that needs to handle the multi-condition/ data-augmented training). The reason is that augmenting data to train the UBM and T matrices is not such a good idea since it is an unsupervised task. However, the xvectors are trained in a supervised way with multiclass cross-entropy to discriminate among speakers. That makes the xvectors more invariant to those distorsions. Also, you can still use all the same back-end bag-of-tricks that we have accumulated over the years for ivector systems.

@danpovey
Copy link
Contributor

danpovey commented Oct 2, 2017 via email

…n in sre16 recipe and xvector code. Upgrading new python scripts to python3
@david-ryan-snyder
Copy link
Contributor Author

Except for the fbanks vs MFCC issue, I believe I've addressed all the comments. I'm still waiting for the experiment using full dim MFCCs to finish, and then this should be finished.

@entn-at
Copy link
Contributor

entn-at commented Oct 3, 2017

I adapted the setup to SRE10, using the same corpora that are used in the sre10/v2 recipe (SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher) and got an EER of 2% (no further tuning, just using the same data augmentation procedure, fbank features, LDA dimension, no PLDA model adaptation (trained on SRE04-08) etc.).
Would be interesting to run this on the PRISM set or similar to see how it compares under noise/reverberation (but I assume you've already done those kinds of experiments - I'm just posting to confirm that it runs fine ;-)).

@david-ryan-snyder
Copy link
Contributor Author

Here are the updated results using MFCCs instead of Fbanks. The results are no worse. Maybe it's even slightly better. I'll go ahead and change it over in the recipe.

  # -- Pooled --
  #                             Fbanks     MFCC
  # EER:                        8.97       8.66
  # min_Cprimary:               0.61       0.61
  # act_Cprimary:               0.63       0.62
  #
  # -- Cantonese --
  # EER:                        4.71       4.69
  # min_Cprimary:               0.42       0.42
  # act_Cprimary:               0.43       0.43
  #
  # -- Tagalog --
  # EER:                       13.23      12.63
  # min_Cprimary:               0.77       0.76
  # act_Cprimary:               0.82       0.81

@osadj
Copy link
Contributor

osadj commented Oct 3, 2017 via email

@david-ryan-snyder
Copy link
Contributor Author

david-ryan-snyder commented Oct 3, 2017

@entn-at,

Very cool, thanks for running this!

I've done a similar experiment and got 1.7% EER on SRE10 (pooled, gender independent). I used different datasets (I included Voxceleb), though.

@david-ryan-snyder
Copy link
Contributor Author

david-ryan-snyder commented Oct 3, 2017

SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher

So you used Fisher in the xvector recipe? I've been wondering if this will work, since there aren't a lot of recordings per speaker. Maybe we could try an experiment where Fisher is augmented more aggressively than the other datasets (e.g., 3x or 4x). Might help to offset the limited data.

@osadj
Copy link
Contributor

osadj commented Oct 3, 2017 via email

@entn-at
Copy link
Contributor

entn-at commented Oct 3, 2017

Omid: I ran it on C5 extended (same as the sre10/{v1,v2} recipes), pooled male/female (though it would be easy to get those numbers).

@osadj
Copy link
Contributor

osadj commented Oct 3, 2017 via email

@entn-at
Copy link
Contributor

entn-at commented Oct 3, 2017

David: Yes, that is indeed the issue and the reason why most Fisher speakers get discarded during example preparation (I end up with 7170 speakers in the combined training set, 5767 of which are from SWB/SRE). More augmentation or treating Fisher differently when pruning the lists may make a big difference. Adding VoxCeleb is a good idea, I've used it in a different context for training in-domain PLDA/calibration. Another reason for the difference in EER I got could be that I reduced the number of training epochs from 3 to 2, as I only used 3 training jobs (num-jobs-initial/final=3).

Omid: Note that this was merely a proof-of-concept ("making it run"), I'm sure you can get better performance by tuning hyperparameters and adding more datasets. Without any tuning whatsoever, I get EER=2%, minDCF10=0.435
(for comparison, on the same set (SRE10, C5 extended, gender independent, pooled female/male) for the v1 recipe I got EER=2.26%, minDCF10=0.488, and for v2: EER=1.02%, minDCF10=0.197)

Edit: I included the run.sh as gist (https://gist.github.com/entn-at/486562edfa01f99a07a03d9c905a1a50). As you can see, it's very close to the original SRE16 script.

@david-ryan-snyder
Copy link
Contributor Author

@osadj: I don't have the 1.7% EER on SRE10 using xvectors documented anywhere yet. In the future we'll probably work on a recipe for this (unless someone else finds a good one first, e.g., @entn-at :-)).

I think we can still do a lot better on SRE10 by using more augmentation on Fisher, additional datasets, etc. Maybe including ASR BNFs will help as well (although it's aesthetically undesirable, it's a fairer comparison with sre10/v2).

…ted the results (results are no worse). This was done because mfccs are more compressible on disk.
@david-ryan-snyder
Copy link
Contributor Author

@danpovey, I've addressed those comments. I think the PR should be ready to ship now.

@danpovey danpovey merged commit e082c17 into kaldi-asr:master Oct 3, 2017
@danpovey
Copy link
Contributor

danpovey commented Oct 3, 2017

thanks! Merged.

@497354078
Copy link

497354078 commented Jan 4, 2018

Hi, I have a question about TDNN,
[sre16/v2/local/nnet3/xvector/run_xvector.sh] line 105
'''The stats pooling layer. Layers after this are segment-level.
In the config below, the first and last argument (0, and ${max_chunk_size})
means that we pool over an input segment starting at frame 0
and ending at frame ${max_chunk_size} or earlier. The other arguments (1:1)
mean that no subsampling is performed.'''
stats-layer name=stats config=mean+stddev(0:1:1:${max_chunk_size})
--this layer's output dims is 3001, but mean(1500) concat stddev(1500) is 3000dims. Can you tell the detail? thanks.

@david-ryan-snyder
Copy link
Contributor Author

david-ryan-snyder commented Jan 4, 2018

@497354078, in nnet3, this pooling layer is implemented as two components: StatisticsExtractionComponent and StatisticsPoolingComponent.

The first component computes sum(X), sum(X^2), and a count. That's why this component has 3001 rather than 3000 components (1500 + 1500 + 1). This is then received as input by the second layer, which uses these sums and the count to compute the avg(X) and stddev(X). The output of this second layer is actually 3000, as you would expect.

It might help to look at the forward computation for each component: https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L447 and https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L771

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants