# Comparaison between Word2vec and FastText

A text can be interpreted from different perspectives among them let’s consider the words, the sentences, and the full document. In modern NLP – not gimmicks such keyword search – different methodologies consider those 3 dimensions when they try, for instance, to run a topic detection.

Word2vec treats each word in corpus like an atomic entity and generates a vector for each word. In this sense Word2vec is very similar to Glove – both treat words as the smallest unit to train on.

FastText – which is essentially an extension of word2vec model – treats each word as composed of character n-grams. So the vector for a word is made of the sum of this character n-grams. For example, the word vector “apple” is a sum of the vectors of the n-grams:

**“<ap”, “app”, ”appl”, ”apple”, ”apple>”, “ppl”, “pple”, ”pple>”, “ple”, ”ple>”, ”le>”**

(assuming hyperparameters for smallest ngram[minn] is 3 and largest ngram[maxn] is 6).

In [0]:
!pip install --quiet lxml

In [0]:
import numpy as np
import os
from random import shuffle
import re
import urllib.request
import zipfile
import lxml.etree
import logging


In [0]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

In [0]:
#download the data
urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")
# extract subtitle
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))

In [0]:
# remove parenthesis 
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)
# store as list of sentences
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
# store as list of lists of words
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

In [10]:
from gensim.models import Word2Vec
model_ted_wv = Word2Vec(sentences=sentences_ted, size=100, window=5, min_count=5, workers=4, sg=0)


2018-09-17 18:19:02,206 : INFO : 'pattern' package not found; tag filters are not available for English
2018-09-17 18:19:02,214 : INFO : collecting all words and their counts
2018-09-17 18:19:02,216 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-09-17 18:19:02,260 : INFO : PROGRESS: at sentence #10000, processed 172561 words, keeping 12186 word types
2018-09-17 18:19:02,305 : INFO : PROGRESS: at sentence #20000, processed 344873 words, keeping 17274 word types
2018-09-17 18:19:02,351 : INFO : PROGRESS: at sentence #30000, processed 516752 words, keeping 20770 word types
2018-09-17 18:19:02,402 : INFO : PROGRESS: at sentence #40000, processed 704349 words, keeping 23855 word types
2018-09-17 18:19:02,454 : INFO : PROGRESS: at sentence #50000, processed 889918 words, keeping 26831 word types
2018-09-17 18:19:02,509 : INFO : PROGRESS: at sentence #60000, processed 1073225 words, keeping 28991 word types
2018-09-17 18:19:02,559 : INFO : PROGRESS: at sentenc

In [11]:
model_ted_wv.wv.most_similar("man")

2018-09-17 18:19:35,260 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('woman', 0.8531621694564819),
 ('guy', 0.8057481050491333),
 ('soldier', 0.7574235796928406),
 ('lady', 0.7569026350975037),
 ('boy', 0.7550076246261597),
 ('girl', 0.7447366714477539),
 ('gentleman', 0.7282483577728271),
 ('poet', 0.7053318023681641),
 ('david', 0.698114812374115),
 ('kid', 0.669856607913971)]

In [13]:
model_ted_wv.wv.most_similar("Hitman")

KeyError: ignored

In [12]:
from gensim.models import FastText
model_ted_ft = FastText(sentences_ted, size=100, window=5, min_count=5, workers=4,sg=1)

2018-09-17 18:19:36,722 : INFO : collecting all words and their counts
2018-09-17 18:19:36,725 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-09-17 18:19:36,769 : INFO : PROGRESS: at sentence #10000, processed 172561 words, keeping 12186 word types
2018-09-17 18:19:36,810 : INFO : PROGRESS: at sentence #20000, processed 344873 words, keeping 17274 word types
2018-09-17 18:19:36,849 : INFO : PROGRESS: at sentence #30000, processed 516752 words, keeping 20770 word types
2018-09-17 18:19:36,895 : INFO : PROGRESS: at sentence #40000, processed 704349 words, keeping 23855 word types
2018-09-17 18:19:36,938 : INFO : PROGRESS: at sentence #50000, processed 889918 words, keeping 26831 word types
2018-09-17 18:19:36,983 : INFO : PROGRESS: at sentence #60000, processed 1073225 words, keeping 28991 word types
2018-09-17 18:19:37,024 : INFO : PROGRESS: at sentence #70000, processed 1249812 words, keeping 31170 word types
2018-09-17 18:19:37,065 : INFO : PROGRESS: a

In [17]:
model_ted_ft.wv.most_similar("man")


  if np.issubdtype(vec.dtype, np.int):


[('batman', 0.8135496377944946),
 ('woman', 0.7971199750900269),
 ('ekman', 0.7905158996582031),
 ('hoffman', 0.7802772521972656),
 ('shaman', 0.7646391987800598),
 ('foreman', 0.7645190954208374),
 ('stuntman', 0.7640399932861328),
 ('salman', 0.7622843384742737),
 ('wurman', 0.757695198059082),
 ('chapman', 0.7491511702537537)]

In [16]:
model_ted_ft.wv.most_similar("Hitman")


2018-09-17 18:25:09,505 : INFO : precomputing L2-norms of word weight vectors
2018-09-17 18:25:09,545 : INFO : precomputing L2-norms of ngram weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('stuntman', 0.8959266543388367),
 ('craftsman', 0.8656678199768066),
 ('seligman', 0.8425419926643372),
 ('batman', 0.8360868096351624),
 ('pseudonym', 0.8333516120910645),
 ('gottman', 0.8259511590003967),
 ('gymnosophist', 0.8253354430198669),
 ('wurman', 0.822790265083313),
 ('pseudo', 0.8210907578468323),
 ('chandelier', 0.8204042911529541)]