Training word embeddings using hierarchical softmax with a semantic tree.
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information. Update Aug 31, 2017
cbow_hs_huffman_vs_semantic_tree.ipynb Add training files. Aug 31, 2017
cbow_hs_negative_sampling.ipynb Run same experiment with negative sampling. Sep 8, 2017 Update Aug 31, 2017


Training word embeddings using hierarchical softmax with a semantic tree.

In hierarchical softmax, the regular softmax layer is replaced by a binary tree, where at each node of the tree a classifier is trained to select the child node containing the center word given its context (see Morin & Bengio (2005) for details). By default the implementations in word2vec and fastText use a Huffman tree based on word frequency. However, this does not contain any semantic information so that training consistent classifiers at each node might be hard.

Based on the work of Mnih and Hinton (2008) who showed that semantic trees can improve language models compared to random trees, we use GMM clustering of initial word vectors to derive a tree and feed this to fastText to replace the Huffman tree in a second round of training. In accordance with their results, this also improves the performance in word analogy and word similarity tasks.


The repository contains the iPyhton notebook cbow_hs_huffman_vs_semantic_tree to run the experiment. A modified version of fastText that reads the resulting tree file is provided here. It needs to be compiled and the paths in the iPython notebook adjusted to point to the fasttext executable and training data file. The file contains functions for clustering a set of word vectors into a binary tree and storing it in a format that can be read by the modified fastText version.

More details on the experiment and results can be found in this blog post.


Morin, F., & Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model. Aistats, 5.

Mnih, A., & Hinton, G. E. (2008). A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems, 1–8.