Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Unsupervised Acoustic Word Embeddings on Buckeye English and NCHLT Xitsonga


Note: This is an updated version of the recipe at The code here uses Python 3 (instead of Python 2.7) and uses LibROSA for feature extraction (instead of HTK). Because of slight differences in the resulting features, the results here does not exactly match those in the paper below, since the older recipe was used for the paper.

Unsupervised acoustic word embedding (AWE) approaches are implemented and evaluated on the Buckeye English and NCHLT Xitsonga speech datasets. The experiments are described in:

  • H. Kamper, "Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models," in Proc. ICASSP, 2019. [arXiv]

Please cite this paper if you use the code.


The code provided here is not pretty. But I believe that research should be reproducible. I provide no guarantees with the code, but please let me know if you have any problems, find bugs or have general comments.

Download datasets

Portions of the Buckeye English and NCHLT Xitsonga corpora are used. The whole Buckeye corpus is used and a portion of the NCHLT data. These can be downloaded from:

From the complete Buckeye corpus we split off several subsets: the sets labelled as devpart1 and zs respectively correspond to the English1 and English2 sets in Kamper et al., 2016. We use the Xitsonga dataset provided as part of the Zero Speech Challenge 2015 (a subset of the NCHLT data).

Create and run Docker image

This recipe provides a Docker image containing all the required dependencies. The recipe can be run without Docker, but then the dependencies need to be installed separately (see below). To use the Docker image, you need to:

To build the Docker image, run:

cd docker
docker build -f Dockerfile.gpu -t py3_tf1.13 .
cd ..

The remaining steps in this recipe can be run in a container in interactive mode. The dataset directories will also need to be mounted. To run a container in interactive mode with the mounted directories, run:

docker run --runtime=nvidia -it --rm -u $(id -u):$(id -g) -p 8887:8887 \
    -v /r2d2/backup/endgame/datasets/buckeye:/data/buckeye \
    -v /r2d2/backup/endgame/datasets/zrsc2015/xitsonga_wavs:/data/xitsonga_wavs \
    -v "$(pwd)":/home \

Alternatively, run ./, which executes the above command and starts an interactive container.

To directly start a Jupyter notebook in a container, run ./ and open http://localhost:8889/.

If not using Docker: Install dependencies

If you are not using Docker, install the following dependencies:

To install speech_dtw, clone the required GitHub repositories into ../src/ and compile the code as follows:

mkdir ../src/  # not necessary using docker
git clone ../src/speech_dtw/
cd ../src/speech_dtw
make test
cd -

Extract speech features

Update the paths in to point to the datasets. If you are using docker, will already point to the mounted directories. Extract MFCC and filterbank features in the features/ directory as follows:

cd features

More details on the feature file formats are given in features/

Evaluate frame-level features using the same-different task

This is optional. To perform frame-level same-different evaluation based on dynamic time warping (DTW), follow samediff/

Obtain downsampled acoustic word embeddings

Extract and evaluate downsampled acoustic word embeddings by running the steps in downsample/

Train neural acoustic word embeddings

Train and evaluate neural network acoustic word embedding models by running the steps in embeddings/


Some notebooks used during development are given in the notebooks/ directory. Note that these were used mainly for debugging and exploration, so they are not polished. A docker container can be used to launch a notebook session by running ./ and then opening http://localhost:8889/.

Unit tests

In the root project directory, run make test to run unit tests.


The code is distributed under the Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0).


Unsupervised acoustic word embeddings evaluated on Buckeye English and NCHLT Xitsonga data in Python 3.



No releases published


No packages published