Using the GMM Specializer
Clone this wiki locally
Note: this specializer version is outdated. For the latest release please see: https://github.com/egonina/pycasp (PyCASP framework contains the GMM specializer among other things).
Here we describe how to use the GMM specializer in your Python code and give examples. Please refer to our HotPar'11 and ASRU'11 papers for details on the specializer and the speaker diarization example respectively. Our specializer uses numpy to store and manipulate arrays. Note that this code is still under development.
egonina at eecs dot berkeley dot edu with questions and comments.
Installing the specializer
Simply check the code out of the repo, and in the base directory run
$> python setup.py install --user or
$> sudo python setup.py install
However, we highly recommend using virtualenv and pip to install all packages in a private, user-level environment. Follow the directions here, and once you have a virtualenv running simply do
(my_env)$> pip install gmm_specializer
The package managers should fetch ASP and all its attendant dependencies and install all of them on your machine. If you have trouble with this step, consult these directions for manual installation of ASP.You can also get ASP pre-installed on a VM Image.
However, there are some external requirements that Pythonic package managers cannot take care of on your behalf, specifically the compilers required to actually build the specialized code.
If you want to use the CUDA backend for GPUs, you must install NVIDIA's compiler (nvcc), runtime, driver and at least one GPU card. The compiler must be on your $PATH, and the runtime libraries must be on your $LD_LIBRARY_PATH. We recommend a >3.0 release of the CUDA toolkit (especially 4.1), but the specializers should work with card compute capabilities as low as 1.2.
If you want to use the Cilk+ backend for Intel multicore processors, you must install Intel's compiler (icc), libraries, and the Cilk+ runtime. The compiler must be on your $PATH, and the runtime libraries must be on your $LD_LIBRARY_PATH. We recommend the 12.0.5 release of Cilk+.
Finally, all specializers built on ASP have a configuration file that contains some simple directives for each specializer. We provide an example configuration in
asp_config.yml. If you already have some ASP-based specializers installed, just append this file to the existing one. Otherwise, copy it to
~/.asp_config.yml. With these settings you can control whether the Cilk or CUDA backend will be the target of specialization, which CUDA device specialized code will be run on, and whether the specializer will attempt to auto-tune itself to your particular machine and problem space (experimental).
Once you think the python dependencies, compilers, environment variables and config file are set up correctly, try
Then take a look at the sample applications provided in
examples/ and read on.
Importing the specializer
After installing Asp and the GMM specializer, you need to import it in your Python script like so:
from gmm_specializer.gmm import *
Creating the GMM object
Creating a GMM object is just like creating an object of any class in Python. You can either create an empty GMM object, specifying its dimensions (M = number of components, D = dimension) and if it has either a diagonal or full covariance matrix (cvtype='diag' for diagonal and cvtype='full' for full, diagonal by default):
gmm = GMM(M, D, cvtype='diag')
The parameters will be initialized randomly from the data when the
train() function is called (see below). GMM can also be initialized with existing parameters, like so:
gmm = GMM(M, D, cvtype='diag', means=my_means, covars=my_covar, weights=my_weights)
Where means, covars and weights are numpy arrays. Note: when training the GMM, these parameters will get overwritten by new parameters after training, if you are using parameters from a different GMM, make sure to make a copy of the parameters first and pass that to the GMM constructor.
To train the GMM object using the Expectation-Maximization (EM) algorithm on a set of observations, use the
lkld = gmm.train(data, max_em_iters=1, min_em_iters=3)
data is an N by D numpy array of observation vectors (N vectors, each of D dimensions) and min_em_iters and max_em_iters bound the number of EM iterations (both optional, default min = 1, max = 10). It returns the likelihood of the trained GMM fitting the data.
Computing likelihood given the trained GMM
To compute the log-likelihood of the trained GMM on a new set of observations use the
log_lklds = gmm.score(data)
data is an N by D numpy array. The function returns a numpy array of N log-likelihoods, one for each observation vector. To get cummulative statistics about the data, you can use numpy.average() or numpy.sum().
Other evaluation functions for trained GMMs
We emulate the functionality provided by sklearn.mixture.GMM by providing other functions to evaluate trained GMMs.
log_lklds, posteriors = gmm.eval(data) returns N log-likelihoods, and N by M posterior probabilities (a probability of each component explaining each event).
log_lklds, indexes = gmm.decode(self, obs_data) returns N log-likelihoods, and N indexes indicating which component most probably explained each event.
indexes = gmm.predict(self, obs_data) returns N indexes N indexes indicating which component most probably explained each event.
Accessing the GMM parameters
You can access the GMM mean, covariance and weight parameters like so:
means = gmm.components.means
covariance = gmm.components.covars
weights = gmm.components.weights
means is an M by D array (number of components by number of dimensions),
covariance is an M by D by D array (number of components by number of dimensions by number of dimensions) and
weights is an array of size M (number of components).
Example: Simple Training and Evaluation
This is a simple example that takes a training dataset
training_data, creates a 32-component GMM and trains it on the data, and then computes the average log_likelihood of a testing dataset:
from gmm_specializer.gmm import * import numpy as np training_data = np.array(get_training_data()) # training_data.shape = (N1, D) testing_data = np.array(get_testing_data()) # testing_data.shape = (N2, D) M = 32 D = training_data.shape # get the D dimension from the data gmm = GMM(M, D, cvtype=1) # create new GMM object gmm.train(training_data, max_em_iters=5) # train the GMM on the training data log_lklds = gmm.score(testing_data) # compute the log likelihoods of the testing data obsevations print "Average log likelihood for testing data = ", np.average(log_lklds)
examples/ directory includes two example applications: Speaker Diarization in
cluster.py and a Song Recommendation Engine
We have implemented a speaker diarization application using the GMM specializer. The task of the application is to determine "who spoke when?" in an audio recording. The algoritm is based on agglomerative hierarchical clustering of GMMs using the Bayesian Information Criterion (BIC) to segment the audio feature files into speaker-homogeneous regions. Here we briefly describe the imlementation in Python using the GMM specializer. For more details on the applications, please see our ASRU'11 paper.
The script for diarization is in
examples/cluster.py. After reading the config file (see below) The
__main__ function creates a
Diarizer object, which then creates an initial list of GMMs used for clustering. It then calls the
cluster() to perform the main clustering computation. The algorithm is outlined as follows:
- Initialization: Train a set of GMMs, one per initial segment, using the expectation-maximization(EM) algorithm.
- Re-segmentation: Re-segment the audio track using majority vote over the GMMs’ likelihoods for 2.5s duration.
- Re-training: Retrain the GMMs on the new segmentation.
- Agglomeration: Select the most similar GMMs and merge them. At each iteration, the algorithm checks all possible pairs of GMMs, looking to obtain an improvement in BIC scores by merging the pair and retraining it on the pair’s combined audio segments. The GMM clusters of the pair with the largest improvement in BIC scores are permanently merged. The algorithm then repeats from the re-segmentation step until there are no remaining pairs whose merging would lead to an improved BIC score.
The script has the ability to choose between using the KL-divergence-based approximation for choosing the GMM pairs to merge, or comparing all pairs of GMMs (see paper). This setting can be specified in the config file (see below).
Finally, the script outputs two types of files, the segmentation result (in the NIST RTTM format) and the final parameters of the trained GMMs.
To call the script use regular python script execution call:
Using the diarization config file
The script takes in a config file to assist in setting all the parameters for diarization. The default script name that the script takes is
diarizer.cfg. You can also pass it your own config file by using the
python examples/cluster.py -c my_config.cfg. We are using the Python ConfigParser library, so the script requires the parameters in the config file to go under the
[Diarizer] section tag. To display the config file settings, you can use the
--help option when running the script:
python examples/cluster.py --help.
Here's an example
diarizer.cfg file on a sample AMI meeting:
[Diarizer] basename = IS1000a mfcc_feats = /AMI/featuresIS1000a_seg.feat.htk spnsp_file = /AMI/spnsp/IS1000a_seg.spch output_cluster = IS1000a.rttm gmm_output = IS1000a.gmm em_iterations = 3 initial_clusters = 16 M_mfcc = 5 KL_ntop = 3 num_seg_iters_init = 1 num_seg_iters = 1 seg_length = 250
Some of the parameters are required and some are optional (and have some default values):
- basename: meeting base name
- mfcc_feats: HTK feature file for the audio recording
- output_cluster: name of the output RTTM file
- gmm_output: name of the GMMs parameters file
- initial_clusters: number of initial clusters
- M_mfcc: number of gaussians per model
- em_iterations: number of EM iteration for trainig (3 by default)
- spnsp_file: Speech/nonspeech file
- KL_ntop: number of GMM pairs to evaluate BIC on (0 to deactivate KL-divergency)
- num_seg_iters_init: number of majority vote segmentation iterations for the initial phase (2 by default)
- num_seg_iters: number of majority vote segmetnation iterations for the main clustering loop (3 by default)
- seg_length: segment length for majority vote (250 by default)
Song Recommendation Engine
We have implemented a simple song recommendation engine using the Million Song Dataset (MSD). The idea is, given a tag (for example a genre like "metal" or "jazz", or mood like "sad", "romantic") to find all songs that match that tag in the Dataset. Then, we select top 20 most similar songs to recommend to the listener using a GMM-UBM approach both from the labeled set of songs and the unlabeled (i.e. the songs in the Dataset that do not contain the tag) set of songs.
The algorithm outline is as follows:
- Set the
category_tagvariable to the tag we want to use for recommendation.
- Given all songs that match the tag (labeled examples), split the set into 70% songs for training set and 30% for testing set. Current tags are selected from "artist_terms" of the song, with "frequency" > 0.8 (see the MSD description).
- Collect all songs from the Dataset that don't contain the tag (we call these unlabeled examples).
- Train a (32 component) GMM on the timbre features of the songs in the trainign set.
- Collect features from all the songs in the Dataset for the UBM (Universal Background Model).
- Train a (32 component) GMM for the UBM on 30% of all song features.
- Compute the log likelihood for the songs in the testing set (labeled examples) for both the tag-GMM and the UBM.
- Compute the log likelihood for the unlabeled example songs for both the tag-GMM and the UBM.
- Display the top 20 recommended songs from the labeled example set.
- Display the top 20 recommended songs from the unlabeled example set.
The example script is
__main__ function performs all the computation and prints the recommendations to the screen. We use Python pickle objects to store the features and the song dictionary for faster access. The script assumes the MSD is loaded locally on the machine and the
get_song_dict() takes a root directory to the MSD data. Please see the script for further details.
To call the script use regular python script execution call: