New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xvectors: DNN Embeddings for Speaker Recognition #1896

Merged
merged 29 commits into from Oct 3, 2017

Conversation

Projects
None yet
6 participants
@david-ryan-snyder
Contributor

david-ryan-snyder commented Sep 20, 2017

Overview
This pull request adds xvectors for speaker recognition. The system consists of a feedforward DNN with a statistics pooling layer. Training is multiclass cross entropy over the list of training speakers (we may add other methods in the future). After training, variable-length utterances are mapped to fixed-dimensional embeddings or “xvectors” and used in a PLDA backend. This is based on http://www.danielpovey.com/files/2017_interspeech_embeddings.pdf, but includes recent enhancements not in that paper, such as data augmentation.

This PR also adds a new data augmentation script, which is important to achieve good performance in the xvector system. It is also helpful for ivectors (but only in PLDA training).

This PR adds a basic SRE16 recipe to demonstrate the system. An ivector system is in v1, and an xvector system is in v2.

Example Generation
An example consists of a chunk of speech features and the corresponding speaker label. Within an archive, all examples have the same chunk-size, but the chunk-size varies across archives. The relevant additions:

  • sid/nnet3/xvector/get_egs.sh — Top-level script for example creation
  • sid/nnet3/xvector/allocate_egs.py — This script is responsible for deciding what is contained in the examples and what archives they belong to.
  • src/nnet3bin/nnet3-xvector-get-egs — The binary for creating the examples. It constructs examples based on the ranges.* file.

Training
This version of xvectors is trained with multiclass cross entropy (softmax over the training speakers). Fortunately, steps/nnet3/train_raw_dnn.py is compatible with the egs created here, so no new code is needed for training. Relevant code:

  • sre16/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh — Does example creation, creates the xconfig, and trains the nnet

Extracting XVectors
After training, the xvectors are extracted from a specified layer of the DNN after the temporal pooling layer. Relevant additions

  • sid/nnet3/xvector/extract_xvectors.sh — Extracts embeddings from the xvector DNN. This is analogous to extract_ivectors.sh.
  • src/nnet3bin/nnet3-xvector-compute — Does the forward computation for the xvector DNN (variable-length input, with a single output).

Augmentation
We’ve found that embeddings almost always benefit from augmented training data. This appears to be true even when evaluated on clean telephone speech. Relevant additions:

  • steps/data/augment_data_dir.py — Similar to reverberate_data_dir.py but only handles additive noise.
  • egs/sre16/v1/run.sh — PLDA training list is augmented with reverb and MUSAN audio
  • egs/sre16/v2/run.sh — DNN training and PLDA list are augmented with reverb and MUSAN.

SRE16 Recipe
The PR includes a bare bones SRE16 recipe. The goal is primarily to demonstrate how to train and evaluate an xvector system. The version in egs/sre16/v1/ is a straightforward i-vector system. The recipe in egs/sre16/v2 contains the DNN embedding recipe. Relevant additions

  • egs/sre16/v1/local/ — A bunch of dataprep scripts
  • egs/sre16/v2/local/nnet3/xvector/prepare_feats_for_egs.sh -- A script that applies cmvn and removes silence frames and writes the results to disk. This is what the nnet examples are generated from.
  • egs/sre16/v1/run.sh — ivector top-level script
  • egs/sre16/v2/run.sh — xvector top-level script

Results for this example:

  xvector (from v2) EER: Pooled 8.76%, Tagalog 12.73%, Cantonese 4.86%
  ivector (from v1) EER: Pooled 12.98%, Tagalog 17.8%, Cantonese 8.35%

Note that the recipe is somewhat "bare bones." We could improve the results for the xvector system further by adding even more training data (e.g., Voxceleb: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/). Both systems would improve from updates to the backend such as adaptive score normalization or more effective PLDA domain adaptation techniques. However, I believe that is orthogonal to this PR.

David Snyder and others added some commits Sep 14, 2017

David Snyder
[egs,src,scripts]: Adding a binaries for xvectors in nnet3bin. Creati…
…ng a recipe in egs/sre16. Adding script for data augmentation
David Snyder
[egs,scripts]: Adding run.sh for baseline sre16 recipe, fixing indent…
… in steps/data/augment_data_dir.py, adding script to extract xvectors
David Snyder
[scripts] fix to steps/libs/nnet3/xconfig/stats_layer.py so that it c…
…an support the xvector config. Also changing sid/nnet3/xvector/get_egs.sh so that it uses utt2num_frames instead of utt2len
David Snyder
[egs] adding egs/sre16/v2/run.sh, the top-level script for the xvecto…
…r recipe. NOTE that this is still a work in progress
David Snyder
[egs,scripts,src] adding results to egs/sre16/v2. Adding gpu support …
…to extract_xvectors. Cosmetic improvements to several xvector scripts, and nnet3-xvector-compute
David Snyder
[egs,src] cosmetic improvements to nnet3-xvector-get-egs, adding scri…
…pt prepare_feats_for_egs.sh which prepares features for xvector training

@david-ryan-snyder david-ryan-snyder changed the title from WIP Xvectors: DNN embeddings for Speaker Recognition to Xvectors: DNN embeddings for Speaker Recognition Sep 24, 2017

@david-ryan-snyder david-ryan-snyder changed the title from Xvectors: DNN embeddings for Speaker Recognition to Xvectors: DNN Embeddings for Speaker Recognition Sep 24, 2017

@osadj

This comment has been minimized.

Show comment
Hide comment
@osadj

osadj Sep 25, 2017

Contributor

@david-ryan-snyder, which i-vector system is this? On SRE'16 evaluation set, we are getting 9.63% EER (minDCF~0.63) with a basic single i-vector system. It is promising to see you are reporting improvement with "xvectors" over your own i-vector system, but you also need to compare the result with that of a state-of-the-art system.

Contributor

osadj commented Sep 25, 2017

@david-ryan-snyder, which i-vector system is this? On SRE'16 evaluation set, we are getting 9.63% EER (minDCF~0.63) with a basic single i-vector system. It is promising to see you are reporting improvement with "xvectors" over your own i-vector system, but you also need to compare the result with that of a state-of-the-art system.

@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Sep 25, 2017

Contributor
Contributor

danpovey commented Sep 25, 2017

@osadj

This comment has been minimized.

Show comment
Hide comment
@osadj

osadj Sep 25, 2017

Contributor

Of course. And no data augmentation was used. We only used previous SRE data (SRE04-SRE10) with GMM based system. Please see NIST's presentation at SRE'16 workshop for more details.

Contributor

osadj commented Sep 25, 2017

Of course. And no data augmentation was used. We only used previous SRE data (SRE04-SRE10) with GMM based system. Please see NIST's presentation at SRE'16 workshop for more details.

@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Sep 25, 2017

Contributor
Contributor

danpovey commented Sep 25, 2017

@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Sep 29, 2017

Contributor
Contributor

danpovey commented Sep 29, 2017

@osadj

This comment has been minimized.

Show comment
Hide comment
@osadj

osadj Sep 29, 2017

Contributor
Contributor

osadj commented Sep 29, 2017

@entn-at

This comment has been minimized.

Show comment
Hide comment
@entn-at

entn-at Sep 29, 2017

Contributor

@osadj For features used in calculating (GMM/DNN senone) posteriors or as features for calculating BW stats? Or are you talking about DNN embedding type systems such as this PR?

Contributor

entn-at commented Sep 29, 2017

@osadj For features used in calculating (GMM/DNN senone) posteriors or as features for calculating BW stats? Or are you talking about DNN embedding type systems such as this PR?

@osadj

This comment has been minimized.

Show comment
Hide comment
@osadj

osadj Sep 29, 2017

Contributor
Contributor

osadj commented Sep 29, 2017

@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Sep 29, 2017

Contributor

@danpovey, FYI, I'm rerunning the v2 recipe with MFCCs (same dim as the filter banks). I think by the time we go through the PR, I'll be able to update the results (which I doubt will change much).

Contributor

david-ryan-snyder commented Sep 29, 2017

@danpovey, FYI, I'm rerunning the v2 recipe with MFCCs (same dim as the filter banks). I think by the time we go through the PR, I'll be able to update the results (which I doubt will change much).

@danpovey

Mostly very small comments.

Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/allocate_egs.py Outdated
Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/allocate_egs.py Outdated
Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/get_egs.sh Outdated
Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/get_egs.sh Outdated
Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/get_egs.sh Outdated
Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/allocate_egs.py Outdated
Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/extract_xvectors.sh Outdated
Show outdated Hide outdated egs/sre08/v1/sid/nnet3/xvector/get_egs.sh Outdated
Show outdated Hide outdated egs/sre16/v1/local/make_musan.py Outdated
Show outdated Hide outdated src/nnet3bin/nnet3-xvector-get-egs.cc Outdated
@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Sep 30, 2017

Contributor

Thanks Dan! I'll look into addressing these.

Contributor

david-ryan-snyder commented Sep 30, 2017

Thanks Dan! I'll look into addressing these.

@dgromero

This comment has been minimized.

Show comment
Hide comment
@dgromero

dgromero Oct 2, 2017

@danpovey, @david-ryan-snyder the reason why we are using Fbanks is to better support the current research we are doing in bandwidth extension/multi-bandwidth systems, as well as some frequency axis invariance (using convolutions). In the experiments, we saw that the results were very similar with MFCCs and Fbanks, but Fbanks are more in line with our future plans.

dgromero commented Oct 2, 2017

@danpovey, @david-ryan-snyder the reason why we are using Fbanks is to better support the current research we are doing in bandwidth extension/multi-bandwidth systems, as well as some frequency axis invariance (using convolutions). In the experiments, we saw that the results were very similar with MFCCs and Fbanks, but Fbanks are more in line with our future plans.

@dgromero

This comment has been minimized.

Show comment
Hide comment
@dgromero

dgromero Oct 2, 2017

@david-ryan-snyder, @danpovey I also wanted to mention that you are being too humble. xvectors are not only shining for short durations! The next round of ICASSP papers will show that. The main advantage of the xvectors (or in general, DNN embedding architectures) is that they are able to leverage large amounts of data much better than the ivector architectures. Specifically, data augmentation helps in the embedding training part (whereas in standard ivector systems it typically hurts, and it is the back-end that needs to handle the multi-condition/ data-augmented training). The reason is that augmenting data to train the UBM and T matrices is not such a good idea since it is an unsupervised task. However, the xvectors are trained in a supervised way with multiclass cross-entropy to discriminate among speakers. That makes the xvectors more invariant to those distorsions. Also, you can still use all the same back-end bag-of-tricks that we have accumulated over the years for ivector systems.

dgromero commented Oct 2, 2017

@david-ryan-snyder, @danpovey I also wanted to mention that you are being too humble. xvectors are not only shining for short durations! The next round of ICASSP papers will show that. The main advantage of the xvectors (or in general, DNN embedding architectures) is that they are able to leverage large amounts of data much better than the ivector architectures. Specifically, data augmentation helps in the embedding training part (whereas in standard ivector systems it typically hurts, and it is the back-end that needs to handle the multi-condition/ data-augmented training). The reason is that augmenting data to train the UBM and T matrices is not such a good idea since it is an unsupervised task. However, the xvectors are trained in a supervised way with multiclass cross-entropy to discriminate among speakers. That makes the xvectors more invariant to those distorsions. Also, you can still use all the same back-end bag-of-tricks that we have accumulated over the years for ivector systems.

@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Oct 2, 2017

Contributor
Contributor

danpovey commented Oct 2, 2017

David Snyder
[egs,src,scripts] Fixes to sre16 data prep scripts. More documentatio…
…n in sre16 recipe and xvector code. Upgrading new python scripts to python3
@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Oct 2, 2017

Contributor

Except for the fbanks vs MFCC issue, I believe I've addressed all the comments. I'm still waiting for the experiment using full dim MFCCs to finish, and then this should be finished.

Contributor

david-ryan-snyder commented Oct 2, 2017

Except for the fbanks vs MFCC issue, I believe I've addressed all the comments. I'm still waiting for the experiment using full dim MFCCs to finish, and then this should be finished.

@entn-at

This comment has been minimized.

Show comment
Hide comment
@entn-at

entn-at Oct 3, 2017

Contributor

I adapted the setup to SRE10, using the same corpora that are used in the sre10/v2 recipe (SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher) and got an EER of 2% (no further tuning, just using the same data augmentation procedure, fbank features, LDA dimension, no PLDA model adaptation (trained on SRE04-08) etc.).
Would be interesting to run this on the PRISM set or similar to see how it compares under noise/reverberation (but I assume you've already done those kinds of experiments - I'm just posting to confirm that it runs fine ;-)).

Contributor

entn-at commented Oct 3, 2017

I adapted the setup to SRE10, using the same corpora that are used in the sre10/v2 recipe (SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher) and got an EER of 2% (no further tuning, just using the same data augmentation procedure, fbank features, LDA dimension, no PLDA model adaptation (trained on SRE04-08) etc.).
Would be interesting to run this on the PRISM set or similar to see how it compares under noise/reverberation (but I assume you've already done those kinds of experiments - I'm just posting to confirm that it runs fine ;-)).

@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Oct 3, 2017

Contributor

Here are the updated results using MFCCs instead of Fbanks. The results are no worse. Maybe it's even slightly better. I'll go ahead and change it over in the recipe.

  # -- Pooled --
  #                             Fbanks     MFCC
  # EER:                        8.97       8.66
  # min_Cprimary:               0.61       0.61
  # act_Cprimary:               0.63       0.62
  #
  # -- Cantonese --
  # EER:                        4.71       4.69
  # min_Cprimary:               0.42       0.42
  # act_Cprimary:               0.43       0.43
  #
  # -- Tagalog --
  # EER:                       13.23      12.63
  # min_Cprimary:               0.77       0.76
  # act_Cprimary:               0.82       0.81

Contributor

david-ryan-snyder commented Oct 3, 2017

Here are the updated results using MFCCs instead of Fbanks. The results are no worse. Maybe it's even slightly better. I'll go ahead and change it over in the recipe.

  # -- Pooled --
  #                             Fbanks     MFCC
  # EER:                        8.97       8.66
  # min_Cprimary:               0.61       0.61
  # act_Cprimary:               0.63       0.62
  #
  # -- Cantonese --
  # EER:                        4.71       4.69
  # min_Cprimary:               0.42       0.42
  # act_Cprimary:               0.43       0.43
  #
  # -- Tagalog --
  # EER:                       13.23      12.63
  # min_Cprimary:               0.77       0.76
  # act_Cprimary:               0.82       0.81

@osadj

This comment has been minimized.

Show comment
Hide comment
@osadj

osadj Oct 3, 2017

Contributor
Contributor

osadj commented Oct 3, 2017

@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Oct 3, 2017

Contributor

@entn-at,

Very cool, thanks for running this!

I've done a similar experiment and got 1.7% EER on SRE10 (pooled, gender independent). I used different datasets (I included Voxceleb), though.

Contributor

david-ryan-snyder commented Oct 3, 2017

@entn-at,

Very cool, thanks for running this!

I've done a similar experiment and got 1.7% EER on SRE10 (pooled, gender independent). I used different datasets (I included Voxceleb), though.

@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Oct 3, 2017

Contributor

SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher

So you used Fisher in the xvector recipe? I've been wondering if this will work, since there aren't a lot of recordings per speaker. Maybe we could try an experiment where Fisher is augmented more aggressively than the other datasets (e.g., 3x or 4x). Might help to offset the limited data.

Contributor

david-ryan-snyder commented Oct 3, 2017

SRE04-08, SWB2 p2/3, SWB Cell p1/2, Fisher

So you used Fisher in the xvector recipe? I've been wondering if this will work, since there aren't a lot of recordings per speaker. Maybe we could try an experiment where Fisher is augmented more aggressively than the other datasets (e.g., 3x or 4x). Might help to offset the limited data.

@osadj

This comment has been minimized.

Show comment
Hide comment
@osadj

osadj Oct 3, 2017

Contributor
Contributor

osadj commented Oct 3, 2017

@entn-at

This comment has been minimized.

Show comment
Hide comment
@entn-at

entn-at Oct 3, 2017

Contributor

Omid: I ran it on C5 extended (same as the sre10/{v1,v2} recipes), pooled male/female (though it would be easy to get those numbers).

Contributor

entn-at commented Oct 3, 2017

Omid: I ran it on C5 extended (same as the sre10/{v1,v2} recipes), pooled male/female (though it would be easy to get those numbers).

@osadj

This comment has been minimized.

Show comment
Hide comment
@osadj

osadj Oct 3, 2017

Contributor
Contributor

osadj commented Oct 3, 2017

@entn-at

This comment has been minimized.

Show comment
Hide comment
@entn-at

entn-at Oct 3, 2017

Contributor

David: Yes, that is indeed the issue and the reason why most Fisher speakers get discarded during example preparation (I end up with 7170 speakers in the combined training set, 5767 of which are from SWB/SRE). More augmentation or treating Fisher differently when pruning the lists may make a big difference. Adding VoxCeleb is a good idea, I've used it in a different context for training in-domain PLDA/calibration. Another reason for the difference in EER I got could be that I reduced the number of training epochs from 3 to 2, as I only used 3 training jobs (num-jobs-initial/final=3).

Omid: Note that this was merely a proof-of-concept ("making it run"), I'm sure you can get better performance by tuning hyperparameters and adding more datasets. Without any tuning whatsoever, I get EER=2%, minDCF10=0.435
(for comparison, on the same set (SRE10, C5 extended, gender independent, pooled female/male) for the v1 recipe I got EER=2.26%, minDCF10=0.488, and for v2: EER=1.02%, minDCF10=0.197)

Edit: I included the run.sh as gist (https://gist.github.com/entn-at/486562edfa01f99a07a03d9c905a1a50). As you can see, it's very close to the original SRE16 script.

Contributor

entn-at commented Oct 3, 2017

David: Yes, that is indeed the issue and the reason why most Fisher speakers get discarded during example preparation (I end up with 7170 speakers in the combined training set, 5767 of which are from SWB/SRE). More augmentation or treating Fisher differently when pruning the lists may make a big difference. Adding VoxCeleb is a good idea, I've used it in a different context for training in-domain PLDA/calibration. Another reason for the difference in EER I got could be that I reduced the number of training epochs from 3 to 2, as I only used 3 training jobs (num-jobs-initial/final=3).

Omid: Note that this was merely a proof-of-concept ("making it run"), I'm sure you can get better performance by tuning hyperparameters and adding more datasets. Without any tuning whatsoever, I get EER=2%, minDCF10=0.435
(for comparison, on the same set (SRE10, C5 extended, gender independent, pooled female/male) for the v1 recipe I got EER=2.26%, minDCF10=0.488, and for v2: EER=1.02%, minDCF10=0.197)

Edit: I included the run.sh as gist (https://gist.github.com/entn-at/486562edfa01f99a07a03d9c905a1a50). As you can see, it's very close to the original SRE16 script.

@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Oct 3, 2017

Contributor

@osadj: I don't have the 1.7% EER on SRE10 using xvectors documented anywhere yet. In the future we'll probably work on a recipe for this (unless someone else finds a good one first, e.g., @entn-at :-)).

I think we can still do a lot better on SRE10 by using more augmentation on Fisher, additional datasets, etc. Maybe including ASR BNFs will help as well (although it's aesthetically undesirable, it's a fairer comparison with sre10/v2).

Contributor

david-ryan-snyder commented Oct 3, 2017

@osadj: I don't have the 1.7% EER on SRE10 using xvectors documented anywhere yet. In the future we'll probably work on a recipe for this (unless someone else finds a good one first, e.g., @entn-at :-)).

I think we can still do a lot better on SRE10 by using more augmentation on Fisher, additional datasets, etc. Maybe including ASR BNFs will help as well (although it's aesthetically undesirable, it's a fairer comparison with sre10/v2).

David Snyder
[egs] changed features from fbanks to mfccs in egs/sre16/v2, and upda…
…ted the results (results are no worse). This was done because mfccs are more compressible on disk.
@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Oct 3, 2017

Contributor

@danpovey, I've addressed those comments. I think the PR should be ready to ship now.

Contributor

david-ryan-snyder commented Oct 3, 2017

@danpovey, I've addressed those comments. I think the PR should be ready to ship now.

@danpovey danpovey merged commit e082c17 into kaldi-asr:master Oct 3, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Oct 3, 2017

Contributor

thanks! Merged.

Contributor

danpovey commented Oct 3, 2017

thanks! Merged.

@497354078

This comment has been minimized.

Show comment
Hide comment
@497354078

497354078 Jan 4, 2018

Hi, I have a question about TDNN,
[sre16/v2/local/nnet3/xvector/run_xvector.sh] line 105
'''The stats pooling layer. Layers after this are segment-level.
In the config below, the first and last argument (0, and ${max_chunk_size})
means that we pool over an input segment starting at frame 0
and ending at frame ${max_chunk_size} or earlier. The other arguments (1:1)
mean that no subsampling is performed.'''
stats-layer name=stats config=mean+stddev(0:1:1:${max_chunk_size})
--this layer's output dims is 3001, but mean(1500) concat stddev(1500) is 3000dims. Can you tell the detail? thanks.

497354078 commented Jan 4, 2018

Hi, I have a question about TDNN,
[sre16/v2/local/nnet3/xvector/run_xvector.sh] line 105
'''The stats pooling layer. Layers after this are segment-level.
In the config below, the first and last argument (0, and ${max_chunk_size})
means that we pool over an input segment starting at frame 0
and ending at frame ${max_chunk_size} or earlier. The other arguments (1:1)
mean that no subsampling is performed.'''
stats-layer name=stats config=mean+stddev(0:1:1:${max_chunk_size})
--this layer's output dims is 3001, but mean(1500) concat stddev(1500) is 3000dims. Can you tell the detail? thanks.

@david-ryan-snyder

This comment has been minimized.

Show comment
Hide comment
@david-ryan-snyder

david-ryan-snyder Jan 4, 2018

Contributor

@497354078, in nnet3, this pooling layer is implemented as two components: StatisticsExtractionComponent and StatisticsPoolingComponent.

The first component computes sum(X), sum(X^2), and a count. That's why this component has 3001 rather than 3000 components (1500 + 1500 + 1). This is then received as input by the second layer, which uses these sums and the count to compute the avg(X) and stddev(X). The output of this second layer is actually 3000, as you would expect.

It might help to look at the forward computation for each component: https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L447 and https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L771

Contributor

david-ryan-snyder commented Jan 4, 2018

@497354078, in nnet3, this pooling layer is implemented as two components: StatisticsExtractionComponent and StatisticsPoolingComponent.

The first component computes sum(X), sum(X^2), and a count. That's why this component has 3001 rather than 3000 components (1500 + 1500 + 1). This is then received as input by the second layer, which uses these sums and the count to compute the avg(X) and stddev(X). The output of this second layer is actually 3000, as you would expect.

It might help to look at the forward computation for each component: https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L447 and https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L771

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment