How To Use

iondel edited this page Aug 25, 2016 · 27 revisions


Standard Phone-Loop Model

The standard model used by AMDTK for discovering acoustic unit is a Bayesian phone-loop model. We abuse the notation and refer to phone as "unit" though there is no guarantee that the model will learn an exact mapping unit to phone. Each unit is represented by a left-to-right HMM and is embedded into a loop structure. The phone-loop model represent a truncated Dirichlet Process (DP) where the atoms are the priors over the HMM representing the units. Each Gaussian component has a normal Gamma-Prior, mixture weights have a Dirichlet prior. The weights of the mixture of unit, sampled from the DP, can be seen as a unigram language model over the unit. Henceforth, we should refer to this standard unit-loop model as the unigram AUD model where AUD stands for Acoustic Unit Discovery. For more details about this model please see: Variational Inference for Acoustic Unit Discovery. Please note that differing from the paper, the model has no prior over the unit-HMM transition probabilities as we found experimentally that learning the transitions is time consuming and does not improve the accuracy of the model.

Creating the model

To create a unigram AUD model run:

utils/ keys_file output_directory 

Internally, this script will proceed in two steps. First, it will estimate the mean m and variance v of the set of features define by keys_file. Then, it will create a phone-loop model where the posterior mean of the Gaussian will be initialized by randomly sampling point from a Gaussian with mean m and variance v. Other parameters of the posteriors distribution are initialized to same value as their respective prior. The prior values should be defined in the file:



  • sil_ngauss is the number of Gaussian for the silence model. If this parameter is greater than 0, then the model assume that each utterance is starting and ending in the silence model. The silence model is a left-to-right HMM with the same number of states as for the other units but each state share the same GMM. When set to 0 the model does not include any silence model and the utterance can start/begin in any unit.
  • concentration is the concentration parameter of the DP. Large concentration will allow the model to have more units whereas small values will constrain the model to use only a small set of units.
  • truncation is number unit of model to approximate the infinite mixture defined by the DP. Informally, this parameters defines the maximum number of units discovered during the training.
  • nstates is the number of states in each unit-HMM. When set to 1, the model then reduced to a simple GMM model.
  • ncomponents is the number of Gaussian components per states.
  • alphas is the hyper-parameters of the symmetric Dirichlet prior of the GMM's weight for each state.
  • kappa is the scaling coefficient of the precision in the Normal-Gamma prior.
  • a is the shape parameter of the Gamma distributions.
  • b (times the variance v) is the rate parameter of the Gamma distributions.


The phone-loop can be trained by the Variational EM algorithm by running:

utils/ parallel_opts niter model_in_dir model_out_dir 

This will run the Variational EM algorithm for niter iterations. Eventually, it is possible to apply some acoustic scaling during the the training. The acoustic scale for the training of the unigram model is defined as unigram_ac_weight in your file. A value of 1 correspond to the standard VB algorithm.

Generating Lattices

The phone-loop model can be used to create lattices by running:

utils/ parallel_opts keys_file model_dir output_dir

The lattices are created by first generating the posteriors of the GMMs of the phone-loop model and then by using this posteriors to HVite. Note that since we do not give any language model to HVite the best path from HTK lattices may differs from the best path computed with AMDTK (which uses either a unigram or a bigram language model over the units). Lattice generation parameters should be define in the file:



  • beam-thresh is the pruning threshold
  • penalty is the inter model transition penalty (in the log domain)
  • gscale is the grammar scale factor
  • conf_latt_dir is the directory where should be store the HTK lattice configuration file

Generating posteriors

The unit per frame posterior can be generated by running:

utils/ parallel_opts keys_file model_in_dir output_dir

The posteriors are computed by running the forward-backward algorithm on the phone-loop model. The posteriors are stored in HTK binary format. Eventually, one can output the per state posteriors by using the options --hmm_states:

utils/ --hmm_states parallel_opts keys_file model_in_dir output_dir

when using the --hmm_states options the output posterior dimension will be total number of units times the number of HMM states per unit. For example if we have a model with the truncation set to 100 and 3 states per unit-HMM the posteriors dimension will 100 x 3 = 300.

Naturally the posteriors tend to be very sharp, assigning all the probability mass to one or two dimensions. To prevent this behavior one can decrease the acoustic scaling in the setup file to a number smaller than one:


In a nutshell, when post_ac_weight is greater than or equal to 1, the posteriors will be very sharp whereas when it is set to values between 0 and 1 the posteriors will be smoother.

Labeling Data

Once a phone-loop model has been trained, it can be used to label new data by running:

utils/ parallel_opts keys_file model_in_dir output_dir

For each utterance specified in keys_file, we compute the best path using the viterbi algorithm. The output is stored in output_dir/key.lab in HTK label file format.

Viterbi Training

When training the phone-loop model with the Variational algorithm, the statistics for updating the model parameters are computed on all possible paths. However, in certain situation we cannot afford to compute the probability of all possible sequence of units. A possible approximation is to re-estimate the model parameters on the supposed best sequence of units per utterance. This can be done by first generating the best path sequence for a set of utterances. Each sequence should be store in HTK label file format. The algorithm will not make use of the timing information this is not necessary to provide this information. Then the model can be retrained by running:

utils/ parallel_opts keys_file model_in_dir labels_dir output_dir

For each entry key of the keys_file we expect the corresponding file to exists labels_dir/key.lab. Other arguments are defined as usual.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.