Recipe: Segmentation and Clustering of Buckeye English and NCHLT Xitsonga
Contributors
Overview
This is a recipe for unsupervised segmentation and clustering of subsets of the Buckeye English and NCHLT Xitsonga corpora. Details of the approach is given in Kamper et al., 2016:
- H. Kamper, A. Jansen, and S. J. Goldwater, "A segmental framework for fully-unsupervised large-vocabulary speech recognition," arXiv preprint arXiv:1606.06950, 2016.
Please cite this paper if you use this code.
The recipe below makes use of the separate segmentalist package which performs the actual unsupervised segmentation and clustering and was developed together with this recipe.
Disclaimer
The code provided here is not pretty. But I believe that research should be reproducible, and I hope that this repository is sufficient to make this possible for the paper mentioned above. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.
Datasets
Portions of the Buckeye English and NCHLT Xitsonga corpora are used. The whole Buckeye corpus will be required to execute the steps here, and the portion of the NCHLT data. These can be downloaded from:
- Buckeye corpus: buckeyecorpus.osu.edu
- NCHLT Xitsonga portion: [www.zerospeech.com](http://www.lscp.net/persons/dupo ux/bootphon/zerospeech2014/website/page_4.html). This requires registration for the challenge.
From the complete Buckeye corpus we split off several subsets. The most
important are the sets labelled as devpart1
and zs
in the code here. These
sets respectively correspond to English1
and English2
in Kamper et al.,
2016, so see the paper for more details. More
details of which speakers are found in which set is also given at the end of
features/readme.md. We use the entire Xitsonga dataset
provided as part of the Zero Speech Challenge 2015 (this was already a subset
of the NCHLT data).
Preliminaries
Obtain all the datasets as described in the Datasets section described above.
Install all the standalone dependencies (see Dependencies section below). Then
clone the required GitHub repositories into ../src/
as follows:
mkdir ../src/
git clone https://github.com/kamperh/segmentalist.git ../src/segmentalist/
git clone https://github.com/kamperh/speech_correspondence.git \
../src/speech_correspondence/
git clone https://github.com/kamperh/speech_dtw.git ../src/speech_dtw/
git clone https://github.com/bootphon/tde.git ../src/tde
For both segmentalist
and speech_dtw
, you need to run make
to build. Unit
tests can be performed by running make test
. See the readmes for more
details.
The speech_correspondence
and speech_dtw
repositories are only necessary if
you plan to do correspondence autoencoder (cAE) feature extraction. This
repository uses the Theano and Pylearn2 dependencies, which is unnecessary if
cAE features will not be used. The tde
repository is only necessary if you
plan to also calculate the evaluation metrics from the Zero Resource Speech
Challenge 2015; without tde
you will not be able to calculate the metrics in
Section 4.5 of Kamper et al., 2016, but you
will still be able to calculate the other metrics in the paper.
The tde
package itself needs to be setup. In ../src/tde/
run the following:
python setup.py build_ext --inplace
python setup_freeze.py build_exe
python move_build.py english english_dir
python move_build.py xitsonga xitsonga_dir
Feature extraction
Some preprocessed resources are given in features/data/
. Extract MFCC
features by running the steps in features/readme.md. Some
steps are optional depending on whether you intend to train a cAE (see below).
Correspondence autoencoder features (optional)
In Kamper et al., 2016 we compare both MFCCs and correspondence autoencoder (cAE) features as input to our system. It is not necessary to perform the steps below if you are happy with using MFCCs. The cAE was first introduced in this paper:
- H. Kamper, M. Elsner, A. Jansen, and S. J. Goldwater, "Unsupervised neural network based feature extraction using weak top-down constraints," in Proc. ICASSP, 2015.
The cAE is trained on word pairs discovered using an unsupervised term discovery (UTD) system (based on the code available here). This UTD system does not form part of the repository here. Instead, the output word pairs discovered by the UTD system are provided as part of the repository in the following files:
- English pairs:
features/data/buckeye.fdlps.0.93.pairs
- Xitsonga pairs:
features/data/zs_tsonga.fdlps.0.925.pairs.v0
The MFCC features for these pairs were extracted as part of feature extraction (previous section).
To train the cAE, run the steps in cae/readme.md.
Unsupervised syllable boundary detection
We use the unsupervised syllable boundary detection algorithm described in:
- O. J. Räsänen, G. Doyle, and M. C. Frank, "Unsupervised word discovery from speech using automatic segmentation into syllable-like units," in Proc. Interspeech, 2015.
Rather than packaging their code within our repository, we provide the output
of their tools directly in syllables/landmarks/
. All that remains is to
extract subsets of Buckeye; run the following:
cd syllables
./get_landmarks_subset.py devpart1
./get_landmarks_subset.py zs
Downsampling: acoustic word embeddings
We use one of the simplest methods to obtain acoustic word embeddings: downsampling. We downsample both MFCC features and cAE features. Run the steps in downsample/readme.md.
Segmentalist: Unsupervised segmentation and clustering
Segmentation and clustering is performed using the segmentalist package. Run the steps in segmentation/readme.md.
Dependencies
Standalone packages:
- Python
- Cython: Used by the
segmentalist
andspeech_dtw
repositories below. - NumPy and SciPy.
- HTK: Used for MFCC feature extraction.
- Theano: Required by the
speech_correspondence
repository below. - Pylearn2: Required by the
speech_correspondence
repository below.
Repositories from GitHub:
- segmentalist: This is the main
segmentation software developed as part of this project. Should be cloned
into the directory
../src/segmentalist/
, done in the Preliminary section above. - speech_correspondence:
Used for correspondence autoencoder feature extraction. Should be cloned
into the directory
../src/speech_correspondence/
, as done in the Preliminary section above. - speech_dtw: Used for correspondence
autoencoder feature extraction. Should be cloned into the directory
../src/speech_dtw/
, as done in the Preliminary section above. - tde: The Zero Resource Speech Challenge
evaluation tools. Should be cloned into the directory
tde/
, as done in the Preliminary section above.