Embedded Segmental K-Means Applied to Buckeye English and NCHLT Xitsonga

License: MIT


Unsupervised acoustic word segmentation and clustering of Buckeye English and NCHLT Xitsonga data using the embedded segmental K-means (ES-KMeans) algorithm. The experiments are described in:

H. Kamper, K. Livescu, and S. J. Goldwater, "An embedded segmental K-means model for unsupervised segmentation and clustering of speech," in Proc. ASRU, 2017. [arXiv]

Please cite this paper if you use the code.

This recipe relies on the separate ES-KMeans package, which performs the actual unsupervised segmentation and clustering.

Download datasets

The Buckeye English and portions of the NCHLT Xitsonga corpora are used:

From the complete Buckeye corpus we split off several subsets. The most important are the sets labelled as devpart1 and zs. These sets respectively correspond to English1 and English2 in (Kamper et al., 2016).

Install dependencies

Dependencies can be installed in a conda environment:

conda env create -f environment.yml
conda activate eskmeans

Install the ES-KMeans package:

mkdir ../src/
git clone ../src/eskmeans/

Extract speech features

Extract MFCCs in features/ as follows:

cd features/

More details on the feature file formats are given in features/

Unsupervised syllable boundary detection

As a preprocessing step, we constrain the allowed word boundary positions to boundaries detected by an unsupervised syllable boundary detection algorithm. We specifically use the algorithm described in:

O. J. Räsänen, G. Doyle, and M. C. Frank, "Pre-linguistic segmentation of speech into syllable-like units," Cognition, 2018.

Extract the syllable boundaries in syllables/ as follows:

cd syllables/
./ buckeye
./ xitsonga

Downsampled acoustic word embeddings

Extract and evaluate downsampled acoustic word embeddings by running the steps in downsample/

ES-KMeans: Segmentation and clustering

Segmentation and clustering is performed using the ES-KMeans package. Run the steps in segmentation/



