# Comparison of FastText and Word2Vec 

Facebook Research open sourced a great project yesterday - [fastText](https://github.com/facebookresearch/fastText), a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are an extension of word2vec. 

I've used gensim to train the word2vec models, and the analogical reasoning task (described in Section 4.1 of [[2]](https://arxiv.org/pdf/1301.3781v3.pdf)) for comparing the word2vec and fastText models. I've compared embeddings trained using the skipgram architecture.

# Download data

In [None]:
import nltk
nltk.download() 
# Only the brown corpus is needed in case you don't have it.
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training

# Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
    for word in nltk.corpus.brown.words():
        f.write('{word} '.format(word=word))

In [None]:
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
!wget http://mattmahoney.net/dc/text8.zip

In [None]:
# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt

# Train models

If you wish to avoid training, you can download pre-trained models instead in the next section.
For training the fastText models yourself, you'll have to follow the setup instructions for [fastText](https://github.com/facebookresearch/fastText) and run the training with -

In [21]:
%%time
# Make sure you set $FT_HOME to your fastText directory root
# Training fastText skipgram model on brown corpus
!$FT_HOME/fasttext skipgram -input brown_corp.txt -output brown_ft  -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 5 -neg 5 -loss ns -t 0.0001

Read 1M words
Progress: 100.0%  words/sec/thread: 31519  lr: 0.000001  loss: 2.289203  eta: 0h0m 
Train time: 50.000000 sec
CPU times: user 1.53 s, sys: 164 ms, total: 1.69 s
Wall time: 1min 11s


In [15]:
%%time
# Training fastText skipgram model on text8 corpus
!$FT_HOME/fasttext skipgram -input text8 -output text8_ft -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 5 -neg 5 -loss ns -t 0.0001

Read 17M words
Progress: 100.0%  words/sec/thread: 20536  lr: 0.000001  loss: 1.830005  eta: 0h0m 
Train time: 1257.000000 sec
CPU times: user 29.5 s, sys: 3.75 s, total: 33.3 s
Wall time: 21min 38s


For training the gensim models -

In [2]:
from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)

# Make sure you create a models dir in case it doesn't exist, or modify MODELS_DIR
MODELS_DIR = 'models/'

# Same values as used for fastText training above
params = {
    'alpha': 0.05,
    'size': 100,
    'window': 5,
    'iter': 5,
    'min_count': 5,
    'sample': 1e-4,
    'sg': 1,
    'hs': 0,
    'negative': 5
}

print("Training word2vec on brown corpus..")
%time brown_gs = Word2Vec(brown.sents(), **params)
brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')

print("Training word2vec on text8 corpus..")
%time text8_gs = Word2Vec(Text8Corpus('text8'), **params)
text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')

Training word2vec on brown corpus..
CPU times: user 1min 11s, sys: 648 ms, total: 1min 12s
Wall time: 43 s
Training word2vec on text8 corpus..
CPU times: user 21min 18s, sys: 3.08 s, total: 21min 21s
Wall time: 8min


# Download models
In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with - 

In [None]:
# download the fastText and gensim models trained on the brown corpus and text8 corpus
!wget https://www.dropbox.com/s/d15f3eumu3i8ld6/models.tar.gz?dl=1 -O models.tar.gz

Once you have downloaded or trained the models (make sure they're in the `models/` directory, or that you've appropriately changed `MODELS_DIR`) and downloaded `questions-words.txt`, you're ready to run the comparison.

# Comparisons

In [3]:
from gensim.models import Word2Vec

logging.root.handlers = []
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

def print_accuracy(model, questions_file):
    print('Evaluating...\n')
    acc = model.accuracy(questions_file)

    sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
    sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
    print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total))
    
    syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
    syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
    print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))

MODELS_DIR = 'models/'

word_analogies_file = 'questions-words.txt'
print('\nLoading FastText embeddings')
brown_ft = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_accuracy(brown_ft, word_analogies_file)

print('\nLoading Gensim embeddings')
brown_gs = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for Word2Vec:')
print_accuracy(brown_gs, word_analogies_file)

2016-08-08 20:12:12,312 : INFO : loading projection weights from models/brown_ft.vec



Loading FastText embeddings


2016-08-08 20:12:14,070 : INFO : loaded (15173, 100) matrix from models/brown_ft.vec
2016-08-08 20:12:14,198 : INFO : precomputing L2-norms of word weight vectors


Accuracy for FastText:
Evaluating...



2016-08-08 20:12:14,936 : INFO : family: 14.8% (27/182)
2016-08-08 20:12:19,156 : INFO : gram1-adjective-to-adverb: 73.5% (516/702)
2016-08-08 20:12:19,716 : INFO : gram2-opposite: 81.8% (108/132)
2016-08-08 20:12:27,212 : INFO : gram3-comparative: 61.5% (649/1056)
2016-08-08 20:12:28,835 : INFO : gram4-superlative: 68.6% (144/210)
2016-08-08 20:12:32,580 : INFO : gram5-present-participle: 67.1% (436/650)
2016-08-08 20:12:38,908 : INFO : gram7-past-tense: 11.5% (145/1260)
2016-08-08 20:12:40,975 : INFO : gram8-plural: 60.5% (334/552)
2016-08-08 20:12:42,718 : INFO : gram9-plural-verbs: 71.1% (243/342)
2016-08-08 20:12:42,729 : INFO : total: 51.2% (2602/5086)
2016-08-08 20:12:42,756 : INFO : loading projection weights from models/brown_gs.vec



Semantic: 27/182, Accuracy: 14.84%
Syntactic: 2575/4904, Accuracy: 52.51%


Loading Gensim embeddings


2016-08-08 20:12:45,270 : INFO : loaded (15173, 100) matrix from models/brown_gs.vec
2016-08-08 20:12:45,410 : INFO : precomputing L2-norms of word weight vectors


Accuracy for Word2Vec:
Evaluating...



2016-08-08 20:12:45,907 : INFO : family: 24.2% (44/182)
2016-08-08 20:12:50,313 : INFO : gram1-adjective-to-adverb: 0.6% (4/702)
2016-08-08 20:12:51,338 : INFO : gram2-opposite: 0.8% (1/132)
2016-08-08 20:12:56,223 : INFO : gram3-comparative: 3.5% (37/1056)
2016-08-08 20:12:57,968 : INFO : gram4-superlative: 1.4% (3/210)
2016-08-08 20:13:03,522 : INFO : gram5-present-participle: 0.8% (5/650)
2016-08-08 20:13:10,912 : INFO : gram7-past-tense: 1.4% (18/1260)
2016-08-08 20:13:13,115 : INFO : gram8-plural: 5.1% (28/552)
2016-08-08 20:13:14,740 : INFO : gram9-plural-verbs: 1.5% (5/342)
2016-08-08 20:13:14,741 : INFO : total: 2.9% (145/5086)



Semantic: 44/182, Accuracy: 24.18%
Syntactic: 101/4904, Accuracy: 2.06%



Word2Vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based. 

Let me explain that better.

According to the paper [[1]](https://arxiv.org/abs/1607.04606), embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for `apparently` would include information from both character n-grams `apparent` and `ly` (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.

Example analogy:

`amazing amazingly calm calmly`

This analogy is marked correct if: 

`embedding(amazing)` - `embedding(amazingly)` = `embedding(calm)` - `embedding(calmly)`

Both these subtractions would result in a very similar set of remaining ngrams.
No surprise the fastText embeddings do extremely well on this.

Let's do a small test to validate this hypothesis - fastText differs from word2vec only in that it uses char n-gram embeddings as well as the actual word embedding in the scoring function to calculate scores and then likelihoods for each word, given a context word. In case char n-gram embeddings are not present, this reduces (atleast theoretically) to the original word2vec model. This can be implemented by setting 0 for the max length of char n-grams for fastText.


In [10]:
%%time
# Training fastText skipgram model on brown corpus without n-grams
# If you chose to download the models, this model will already be present in the MODELS_DIR directory
!$FT_HOME/fasttext skipgram -input brown_corp.txt -output brown_ft_no_ng -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 5 -neg 5 -loss ns -t 0.0001 -maxn 0

Read 1M words
Progress: 100.0%  words/sec/thread: 55755  lr: 0.000001  loss: 2.356848  eta: 0h0m 
Train time: 31.000000 sec
CPU times: user 1.32 s, sys: 240 ms, total: 1.56 s
Wall time: 57.7 s


In [11]:
%%time
# Training fastText skipgram model on text8 corpus without n-grams
# If you chose to download the models, this model will already be present in the MODELS_DIR directory
!$FT_HOME/fasttext skipgram -input text8 -output text8_ft_no_ng -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 5 -neg 5 -loss ns -t 0.0001 -maxn 0

Read 17M words
Progress: 100.0%  words/sec/thread: 49050  lr: 0.000001  loss: 1.879224  eta: 0h0m 
Train time: 514.000000 sec
CPU times: user 13.2 s, sys: 1.89 s, total: 15.1 s
Wall time: 9min 16s


In [4]:
print('Loading FastText embeddings')
brown_ft_no_ng = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft_no_ng.vec')
print('Accuracy for FastText (without n-grams):')
print_accuracy(brown_ft_no_ng, word_analogies_file)

print('Accuracy for Word2Vec:')
print_accuracy(brown_gs, word_analogies_file)

print('Accuracy for FastText (with n-grams):')
print_accuracy(brown_ft, word_analogies_file)


2016-08-08 20:13:27,092 : INFO : loading projection weights from models/brown_ft_no_ng.vec


Loading FastText embeddings


2016-08-08 20:13:28,960 : INFO : loaded (15173, 100) matrix from models/brown_ft_no_ng.vec
2016-08-08 20:13:29,109 : INFO : precomputing L2-norms of word weight vectors


Accuracy for FastText (without n-grams):
Evaluating...



2016-08-08 20:13:30,072 : INFO : family: 17.6% (32/182)
2016-08-08 20:13:33,474 : INFO : gram1-adjective-to-adverb: 0.1% (1/702)
2016-08-08 20:13:34,237 : INFO : gram2-opposite: 0.0% (0/132)
2016-08-08 20:13:40,814 : INFO : gram3-comparative: 2.8% (30/1056)
2016-08-08 20:13:41,963 : INFO : gram4-superlative: 0.5% (1/210)
2016-08-08 20:13:45,717 : INFO : gram5-present-participle: 1.4% (9/650)
2016-08-08 20:13:50,518 : INFO : gram7-past-tense: 1.0% (12/1260)
2016-08-08 20:13:54,723 : INFO : gram8-plural: 6.3% (35/552)
2016-08-08 20:13:55,793 : INFO : gram9-plural-verbs: 1.2% (4/342)
2016-08-08 20:13:55,804 : INFO : total: 2.4% (124/5086)



Semantic: 32/182, Accuracy: 17.58%
Syntactic: 92/4904, Accuracy: 1.88%

Accuracy for Word2Vec:
Evaluating...



2016-08-08 20:13:56,576 : INFO : family: 24.2% (44/182)
2016-08-08 20:14:02,202 : INFO : gram1-adjective-to-adverb: 0.6% (4/702)
2016-08-08 20:14:03,355 : INFO : gram2-opposite: 0.8% (1/132)
2016-08-08 20:14:09,065 : INFO : gram3-comparative: 3.5% (37/1056)
2016-08-08 20:14:10,456 : INFO : gram4-superlative: 1.4% (3/210)
2016-08-08 20:14:12,750 : INFO : gram5-present-participle: 0.8% (5/650)
2016-08-08 20:14:20,905 : INFO : gram7-past-tense: 1.4% (18/1260)
2016-08-08 20:14:23,880 : INFO : gram8-plural: 5.1% (28/552)
2016-08-08 20:14:26,089 : INFO : gram9-plural-verbs: 1.5% (5/342)
2016-08-08 20:14:26,092 : INFO : total: 2.9% (145/5086)



Semantic: 44/182, Accuracy: 24.18%
Syntactic: 101/4904, Accuracy: 2.06%

Accuracy for FastText (with n-grams):
Evaluating...



2016-08-08 20:14:27,832 : INFO : family: 14.8% (27/182)
2016-08-08 20:14:32,041 : INFO : gram1-adjective-to-adverb: 73.5% (516/702)
2016-08-08 20:14:33,387 : INFO : gram2-opposite: 81.8% (108/132)
2016-08-08 20:14:39,900 : INFO : gram3-comparative: 61.5% (649/1056)
2016-08-08 20:14:41,677 : INFO : gram4-superlative: 68.6% (144/210)
2016-08-08 20:14:45,600 : INFO : gram5-present-participle: 67.1% (436/650)
2016-08-08 20:14:52,206 : INFO : gram7-past-tense: 11.5% (145/1260)
2016-08-08 20:14:54,469 : INFO : gram8-plural: 60.5% (334/552)
2016-08-08 20:14:57,125 : INFO : gram9-plural-verbs: 71.1% (243/342)
2016-08-08 20:14:57,131 : INFO : total: 51.2% (2602/5086)



Semantic: 27/182, Accuracy: 14.84%
Syntactic: 2575/4904, Accuracy: 52.51%



A-ha! The results for FastText with no n-grams and Word2Vec look a lot more similar (as they should) - the differences could easily result from differences in implementation between fastText and Gensim, and randomization. Especially telling is that the semantic accuracy for FastText has more or less remained the same after removing n-grams, while the syntactic accuracy has taken a giant dive. Our hypothesis that the char n-grams result in better performance on syntactic analogies seems fair.

Let's try with a larger corpus now - text8 (collection of wiki articles). I'm also curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.

In [5]:
print('Loading FastText embeddings')
text8_ft_no_ng = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft_no_ng.vec')
print('Accuracy for FastText (without n-grams):')
print_accuracy(text8_ft_no_ng, word_analogies_file)

print('Loading Gensim embeddings')
text8_gs = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(text8_gs, word_analogies_file)

print('Loading FastText embeddings (with n-grams)')
text8_ft = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText (with n-grams):')
print_accuracy(text8_ft, word_analogies_file)

2016-08-08 20:15:04,545 : INFO : loading projection weights from models/text8_ft_no_ng.vec


Loading FastText embeddings


2016-08-08 20:15:17,475 : INFO : loaded (71290, 100) matrix from models/text8_ft_no_ng.vec


Accuracy for FastText (without n-grams):
Evaluating...



2016-08-08 20:15:17,791 : INFO : precomputing L2-norms of word weight vectors
2016-08-08 20:15:26,290 : INFO : capital-common-countries: 71.5% (362/506)
2016-08-08 20:15:57,625 : INFO : capital-world: 48.6% (706/1452)
2016-08-08 20:16:00,808 : INFO : currency: 22.0% (59/268)
2016-08-08 20:16:23,455 : INFO : city-in-state: 23.7% (358/1511)
2016-08-08 20:16:28,200 : INFO : family: 63.7% (195/306)
2016-08-08 20:16:40,358 : INFO : gram1-adjective-to-adverb: 14.4% (109/756)
2016-08-08 20:16:44,080 : INFO : gram2-opposite: 14.4% (44/306)
2016-08-08 20:16:58,768 : INFO : gram3-comparative: 41.7% (526/1260)
2016-08-08 20:17:05,195 : INFO : gram4-superlative: 27.7% (140/506)
2016-08-08 20:17:18,366 : INFO : gram5-present-participle: 24.0% (238/992)
2016-08-08 20:17:46,193 : INFO : gram6-nationality-adjective: 78.8% (1080/1371)
2016-08-08 20:18:10,738 : INFO : gram7-past-tense: 35.1% (467/1332)
2016-08-08 20:18:31,779 : INFO : gram8-plural: 52.9% (525/992)
2016-08-08 20:18:42,238 : INFO : gram9-


Semantic: 1680/4043, Accuracy: 41.55%
Syntactic: 3301/8165, Accuracy: 40.43%

Loading Gensim embeddings


2016-08-08 20:18:57,261 : INFO : loaded (71290, 100) matrix from models/text8_gs.vec
2016-08-08 20:18:57,575 : INFO : precomputing L2-norms of word weight vectors


Accuracy for word2vec:
Evaluating...



2016-08-08 20:19:03,098 : INFO : capital-common-countries: 68.0% (344/506)
2016-08-08 20:19:28,913 : INFO : capital-world: 43.3% (628/1452)
2016-08-08 20:19:32,656 : INFO : currency: 18.7% (50/268)
2016-08-08 20:19:53,161 : INFO : city-in-state: 23.9% (376/1571)
2016-08-08 20:19:57,613 : INFO : family: 62.1% (190/306)
2016-08-08 20:20:11,106 : INFO : gram1-adjective-to-adverb: 16.3% (123/756)
2016-08-08 20:20:14,284 : INFO : gram2-opposite: 14.4% (44/306)
2016-08-08 20:20:31,385 : INFO : gram3-comparative: 46.3% (584/1260)
2016-08-08 20:20:37,605 : INFO : gram4-superlative: 26.5% (134/506)
2016-08-08 20:20:51,307 : INFO : gram5-present-participle: 24.1% (239/992)
2016-08-08 20:21:12,579 : INFO : gram6-nationality-adjective: 77.6% (1064/1371)
2016-08-08 20:21:29,859 : INFO : gram7-past-tense: 30.9% (411/1332)
2016-08-08 20:21:43,074 : INFO : gram8-plural: 45.2% (448/992)
2016-08-08 20:21:54,261 : INFO : gram9-plural-verbs: 28.8% (187/650)
2016-08-08 20:21:54,265 : INFO : total: 39.3% (4


Semantic: 1588/4103, Accuracy: 38.70%
Syntactic: 3234/8165, Accuracy: 39.61%

Loading FastText embeddings (with n-grams)


2016-08-08 20:22:12,745 : INFO : loaded (71290, 100) matrix from models/text8_ft.vec
2016-08-08 20:22:12,910 : INFO : precomputing L2-norms of word weight vectors


Accuracy for FastText (with n-grams):
Evaluating...



2016-08-08 20:22:20,629 : INFO : capital-common-countries: 57.5% (291/506)
2016-08-08 20:22:59,687 : INFO : capital-world: 42.2% (613/1452)
2016-08-08 20:23:08,156 : INFO : currency: 11.9% (32/268)
2016-08-08 20:23:44,079 : INFO : city-in-state: 18.3% (277/1511)
2016-08-08 20:23:49,602 : INFO : family: 51.6% (158/306)
2016-08-08 20:24:03,708 : INFO : gram1-adjective-to-adverb: 74.5% (563/756)
2016-08-08 20:24:10,048 : INFO : gram2-opposite: 59.8% (183/306)
2016-08-08 20:24:35,713 : INFO : gram3-comparative: 68.7% (865/1260)
2016-08-08 20:24:43,200 : INFO : gram4-superlative: 53.2% (269/506)
2016-08-08 20:25:01,443 : INFO : gram5-present-participle: 57.1% (566/992)
2016-08-08 20:25:21,921 : INFO : gram6-nationality-adjective: 94.7% (1299/1371)
2016-08-08 20:25:41,048 : INFO : gram7-past-tense: 36.6% (487/1332)
2016-08-08 20:25:58,148 : INFO : gram8-plural: 91.3% (906/992)
2016-08-08 20:26:09,872 : INFO : gram9-plural-verbs: 56.6% (368/650)
2016-08-08 20:26:09,887 : INFO : total: 56.3% (


Semantic: 1371/4043, Accuracy: 33.91%
Syntactic: 5506/8165, Accuracy: 67.43%



With the text8 corpus, we observe a similar pattern. Semantic accuracies for all three models are in the same range, while FastText with n-grams performs far better on the syntactic analogies.
The semantic accuracy for all models increases significantly with the increase in corpus size. However, the increase in syntactic accuracy from the increase in corpus size for the n-gram FastText model is lower (in both relative and absolute terms) for the n-gram FastText model. This could possibly indicate that advantages gained by incorporating morphological information could be less significant in case of larger corpus sizes (the corpuses used in the original paper seem to indicate this too).

# Conclusions

These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. This is expected, since most syntactic analogies are morphology based, and the char n-gram approach of fastText takes such information into account. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.

# References

[1] [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606v1.pdf)

[2] [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781v3.pdf)