In [1]:
%matplotlib inline


# FastText Model

Introduces Gensim's fastText model and demonstrates its use on the Lee Corpus.


In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Here, we'll learn to work with fastText library for training word-embedding
models, saving & loading them and performing similarity operations & vector
lookups analogous to Word2Vec.



## When to use fastText?

The main principle behind [fastText](https://github.com/facebookresearch/fastText) is that the
morphological structure of a word carries important information about the meaning of the word.
Such structure is not taken into account by traditional word embeddings like Word2Vec, which
train a unique word embedding for every individual word.
This is especially significant for morphologically rich languages (German, Turkish) in which a
single word can have a large number of morphological forms, each of which might occur rarely,
thus making it hard to train good word embeddings.


fastText attempts to solve this by treating each word as the aggregation of its subwords.
For the sake of simplicity and language-independence, subwords are taken to be the character ngrams
of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams.


According to a detailed comparison of Word2Vec and fastText in
[this notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Word2Vec_FastText_Comparison.ipynb)_,
fastText does significantly better on syntactic tasks as compared to the original Word2Vec,
especially when the size of the training corpus is small. Word2Vec slightly outperforms fastText
on semantic tasks though. The differences grow smaller as the size of the training corpus increases.


fastText can obtain vectors even for out-of-vocabulary (OOV) words, by summing up vectors for its
component char-ngrams, provided at least one of the char-ngrams was present in the training data.




## Training models




For the following examples, we'll use the Lee Corpus (which you already have if you've installed Gensim) for training our model.






In [3]:
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
)

print(model)

2023-09-28 11:37:22,946 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2023-09-28 11:37:22,948 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2023-09-28 11:37:22,954 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2023-09-28T11:37:22.954831', 'gensim': '4.3.2', 'python': '3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0]', 'platform': 'Linux-6.2.0-33-generic-x86_64-with-glibc2.37', 'event': 'created'}
2023-09-28 11:37:23,168 : INFO : FastText lifecycle event {'params': 'FastText<vocab=0, vector_size=100, alpha=0.025>', 'datetime': '2023-09-28T11:37:23.168619', 'gensim': '4.3.2', 'python': '3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0]', 'platform': 'Linux-6.2.0-33-generic-x86_64-with-glibc2.37', 'event': 'created'}

<gensim.models.fasttext.FastText object at 0x7fac98cc7110>


### Training hyperparameters




Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec:

- model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)
- vector_size: Dimensionality of vector embeddings to be learnt (Default 100)
- alpha: Initial learning rate (Default 0.025)
- window: Context window size (Default 5)
- min_count: Ignore words with number of occurrences below this (Default 5)
- loss: Training objective. Allowed values: `ns`, `hs`, `softmax` (Default `ns`)
- sample: Threshold for downsampling higher-frequency words (Default 0.001)
- negative: Number of negative words to sample, for `ns` (Default 5)
- epochs: Number of epochs (Default 5)
- sorted_vocab: Sort vocab by descending frequency (Default 1)
- threads: Number of threads to use (Default 12)


In addition, fastText has three additional parameters:

- min_n: min length of char ngrams (Default 3)
- max_n: max length of char ngrams (Default 6)
- bucket: number of buckets used for hashing ngrams (Default 2000000)


Parameters ``min_n`` and ``max_n`` control the lengths of character ngrams that each word is broken down into while training and looking up embeddings. If ``max_n`` is set to 0, or to be lesser than ``min_n``\ , no character ngrams are used, and the model effectively reduces to Word2Vec.



To bound the memory requirements of the model being trained, a hashing function is used that maps ngrams to integers in 1 to K. For hashing these character sequences, the [Fowler-Noll-Vo hashing function](http://www.isthe.com/chongo/tech/comp/fnv) (FNV-1a variant) is employed.




**Note:** You can continue to train your model while using Gensim's native implementation of fastText.




## Saving/loading models




Models can be saved and loaded via the ``load`` and ``save`` methods, just like
any other model in Gensim.




In [4]:
# Save a model trained via Gensim's fastText implementation to temp.
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
    model.save(tmp.name, separately=[])

# Load back the same model.
loaded_model = FastText.load(tmp.name)
print(loaded_model)

os.unlink(tmp.name)  # demonstration complete, don't need the temp file anymore

2023-09-28 11:37:31,304 : INFO : FastText lifecycle event {'fname_or_handle': '/tmp/saved_model_gensim-gx2m_wjw', 'separately': '[]', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2023-09-28T11:37:31.304884', 'gensim': '4.3.2', 'python': '3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0]', 'platform': 'Linux-6.2.0-33-generic-x86_64-with-glibc2.37', 'event': 'saving'}
2023-09-28 11:37:31,307 : INFO : storing np array 'vectors_ngrams' to /tmp/saved_model_gensim-gx2m_wjw.wv.vectors_ngrams.npy
2023-09-28 11:37:38,060 : INFO : not storing attribute buckets_word
2023-09-28 11:37:38,062 : INFO : not storing attribute vectors
2023-09-28 11:37:38,065 : INFO : not storing attribute cum_table
2023-09-28 11:37:38,106 : INFO : saved /tmp/saved_model_gensim-gx2m_wjw
2023-09-28 11:37:38,109 : INFO : loading FastText object from /tmp/saved_model_gensim-gx2m_wjw
2023-09-28 11:37:38,128 : INFO : loading wv recursively from /tmp/saved_model_gensim-gx2m_wjw.wv.* with mmap=None
2023-09-28 11:3

<gensim.models.fasttext.FastText object at 0x7fac8beb42d0>


The ``save_word2vec_format`` is also available for fastText models, but will
cause all vectors for ngrams to be lost.
As a result, a model loaded in this way will behave as a regular word2vec model.




## Word vector lookup


All information necessary for looking up fastText words (incl. OOV words) is
contained in its ``model.wv`` attribute.

If you don't need to continue training your model, you can export & save this `.wv`
attribute and discard `model`, to save space and RAM.




In [5]:
wv = model.wv
print(wv)

#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print('night' in wv.key_to_index)

<gensim.models.fasttext.FastTextKeyedVectors object at 0x7fac8bfb8b10>
True


In [6]:
print('nights' in wv.key_to_index)

False


In [7]:
print(wv['night'])

array([-0.12888013,  0.04867967, -0.24673456, -0.04103868,  0.05275234,
        0.31245694,  0.4401131 ,  0.5945194 ,  0.20483033, -0.33483908,
        0.08969966, -0.12318928, -0.3225757 ,  0.6259468 , -0.336348  ,
       -0.5068071 ,  0.15760842, -0.18644254, -0.40220895, -0.5493166 ,
       -0.4112685 ,  0.07552315, -0.5955141 , -0.09174105, -0.22124337,
       -0.30470148, -0.55637515, -0.08589523, -0.18355782,  0.145315  ,
       -0.28822622,  0.29130918,  0.83593965, -0.20667416,  0.19416878,
        0.2799458 ,  0.5025871 , -0.02762571, -0.3422564 , -0.27625018,
        0.5239989 , -0.4515748 ,  0.12707706, -0.25491595, -0.5585286 ,
       -0.39250594,  0.07981683,  0.21418914,  0.19030538, -0.05546196,
        0.33893386, -0.5403435 ,  0.30732256, -0.4247741 , -0.26652154,
       -0.2702304 , -0.26582745, -0.08564702,  0.0222022 , -0.3513158 ,
       -0.31648287, -0.40040338, -0.35052294,  0.37484387, -0.08553872,
        0.7157797 ,  0.05957315,  0.05660595,  0.37558448,  0.37

In [8]:
print(wv['nights'])

array([-0.11204359,  0.04297431, -0.21327995, -0.0352643 ,  0.04430781,
        0.26870054,  0.38157523,  0.5155037 ,  0.17746356, -0.290954  ,
        0.07923076, -0.10493508, -0.2797879 ,  0.5387193 , -0.29198322,
       -0.43849862,  0.13543466, -0.16058479, -0.34610537, -0.47520745,
       -0.35236043,  0.06411079, -0.514322  , -0.08066469, -0.18978927,
       -0.26179725, -0.4792625 , -0.07197589, -0.15831928,  0.12707758,
       -0.24708374,  0.2511851 ,  0.72039205, -0.17814907,  0.16783813,
        0.24126118,  0.43550962, -0.02385413, -0.29584044, -0.23919852,
        0.45178562, -0.38916302,  0.10925046, -0.21967076, -0.4832682 ,
       -0.33781338,  0.07183307,  0.18563451,  0.165774  , -0.04679913,
        0.29431117, -0.4669891 ,  0.26610854, -0.3668553 , -0.22980894,
       -0.23220178, -0.2315464 , -0.07219072,  0.02049286, -0.300834  ,
       -0.27233016, -0.3461843 , -0.30240023,  0.32326448, -0.07330327,
        0.6192912 ,  0.05157118,  0.04629685,  0.32418296,  0.32

## Similarity operations




Similarity operations work the same way as word2vec. **Out-of-vocabulary words can also be used, provided they have at least one character ngram present in the training data.**




In [9]:
print("nights" in wv.key_to_index)

False


In [10]:
print("night" in wv.key_to_index)

True


In [11]:
print(wv.similarity("night", "nights"))

0.99999166


Syntactically similar words generally have high similarity in fastText models, since a large number of the component char-ngrams will be the same. As a result, fastText generally does better at syntactic tasks than Word2Vec. A detailed comparison is provided [here](Word2Vec_FastText_Comparison.ipynb).




### Other similarity operations

The example training corpus is a toy corpus, results are not expected to be good, for proof-of-concept only



In [12]:
print(wv.most_similar("nights"))

[('night', 0.9999918341636658),
 ('rights', 0.9999877214431763),
 ('flights', 0.9999875426292419),
 ('overnight', 0.9999867677688599),
 ('fighting', 0.9999852180480957),
 ('fighters', 0.9999850392341614),
 ('fight', 0.9999849796295166),
 ('entered', 0.9999848008155823),
 ('fighter', 0.9999847412109375),
 ('eight', 0.9999843239784241)]


In [13]:
print(wv.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))

0.99994105


In [14]:
print(wv.doesnt_match("breakfast cereal dinner lunch".split()))

'lunch'


In [15]:
print(wv.most_similar(positive=['baghdad', 'england'], negative=['london']))

[('capital,', 0.9996389746665955),
 ('find', 0.9996379017829895),
 ('findings', 0.999630868434906),
 ('field', 0.9996299147605896),
 ('finding', 0.9996282458305359),
 ('seekers.', 0.9996281862258911),
 ('abuse', 0.9996275305747986),
 ('had', 0.9996262192726135),
 ('storm', 0.9996260404586792),
 ('26-year-old', 0.9996227025985718)]


In [16]:
print(wv.evaluate_word_analogies(datapath('questions-words.txt')))

2023-09-28 11:37:55,168 : INFO : Evaluating word analogies for top 300000 words in the model on /home/pat/VSCode/gensim/.venv/lib/python3.11/site-packages/gensim/test/test_data/questions-words.txt
2023-09-28 11:37:55,280 : INFO : family: 0.0% (0/2)
2023-09-28 11:37:55,359 : INFO : gram3-comparative: 8.3% (1/12)
2023-09-28 11:37:55,386 : INFO : gram4-superlative: 33.3% (4/12)
2023-09-28 11:37:55,446 : INFO : gram5-present-participle: 45.0% (9/20)
2023-09-28 11:37:55,526 : INFO : gram6-nationality-adjective: 30.0% (6/20)
2023-09-28 11:37:55,642 : INFO : gram7-past-tense: 5.0% (1/20)
2023-09-28 11:37:55,675 : INFO : gram8-plural: 33.3% (4/12)
2023-09-28 11:37:55,685 : INFO : Quadruplets with out-of-vocabulary words: 99.5%
2023-09-28 11:37:55,688 : INFO : NB: analogies containing OOV words were skipped from evaluation! To change this behavior, use "dummy4unknown=True"
2023-09-28 11:37:55,690 : INFO : Total accuracy: 25.5% (25/98)


(0.25510204081632654,
 [{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
  {'correct': [], 'incorrect': [], 'section': 'capital-world'},
  {'correct': [], 'incorrect': [], 'section': 'currency'},
  {'correct': [], 'incorrect': [], 'section': 'city-in-state'},
  {'correct': [],
   'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')],
   'section': 'family'},
  {'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
  {'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
  {'correct': [('LONG', 'LONGER', 'GREAT', 'GREATER')],
   'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
                 ('GOOD', 'BETTER', 'LONG', 'LONGER'),
                 ('GOOD', 'BETTER', 'LOW', 'LOWER'),
                 ('GREAT', 'GREATER', 'LONG', 'LONGER'),
                 ('GREAT', 'GREATER', 'LOW', 'LOWER'),
                 ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
                 ('LONG', 'LONGER', 'LOW', 'LOWER'),
             

### Word Movers distance

You'll need the optional ``POT`` library for this section, ``pip install POT``.

Let's start with two sentences:



In [17]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

Remove their stopwords.




In [18]:
from gensim.parsing.preprocessing import STOPWORDS
sentence_obama = [w for w in sentence_obama if w not in STOPWORDS]
sentence_president = [w for w in sentence_president if w not in STOPWORDS]

Compute the Word Movers Distance between the two sentences.



In [19]:
distance = wv.wmdistance(sentence_obama, sentence_president)
print(f"Word Movers Distance is {distance} (lower means closer)")

2023-09-28 11:37:56,240 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2023-09-28 11:37:56,242 : INFO : built Dictionary<8 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...> from 2 documents (total 8 corpus positions)
2023-09-28 11:37:56,245 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<8 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...> from 2 documents (total 8 corpus positions)", 'datetime': '2023-09-28T11:37:56.245031', 'gensim': '4.3.2', 'python': '3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0]', 'platform': 'Linux-6.2.0-33-generic-x86_64-with-glibc2.37', 'event': 'created'}


'Word Movers Distance is 0.016112898508040444 (lower means closer)'


That's all! You've made it to the end of this tutorial.




In [20]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('fasttext-logo-color-web.png')
imgplot = plt.imshow(img)
_ = plt.axis('off')

FileNotFoundError: [Errno 2] No such file or directory: 'fasttext-logo-color-web.png'