# Using GenSim and Word2Vec

Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. 

The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. 

For example, strong and powerful would be close together and strong and Paris would be relatively far.

The are two versions of this model and Word2Vec class implements them both:

Skip-grams (SG)

Continuous-bag-of-words (CBOW)


The Word2Vec Skip-gram model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual one-hot encoding of words goes through a ‘projection layer’ to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.

Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings.

# Word2Vec Demo

To see what Word2Vec can do, let’s download a pre-trained model and play around with it. We will fetch the Word2Vec model trained on part of the Google News dataset, covering approximately 3 million words and phrases. Such a model can take hours to train, but since it’s already available, downloading and loading it with Gensim takes minutes.

In [0]:
import warnings
warnings.filterwarnings(action="ignore")

In [0]:
import gensim.downloader as api

In [0]:
wv = api.load('word2vec-google-news-300')



A common operation is to retrieve the vocabulary of a model. That is trivial:

In [0]:
for i, word in enumerate(wv.vocab):
    if i == 10:
        break
    print(i, "word = {}".format(word))

0 word = </s>
1 word = in
2 word = for
3 word = that
4 word = is
5 word = on
6 word = ##
7 word = The
8 word = with
9 word = said


We can easily obtain vectors for terms the model is familiar with:

In [0]:
vec_king = wv['king']

In [0]:
print(vec_king)

Unfortunately, the model is unable to infer vectors for unfamiliar words. This is one limitation of Word2Vec: if this limitation matters to you, check out the FastText model.

In [0]:
try:
    vec_cameroon = wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

The word 'cameroon' does not appear in this model


Moving on, Word2Vec supports several word similarity tasks out of the box. You can see how the similarity intuitively decreases as the words get less and less similar.

In [0]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]

In [0]:
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

Print the 5 most similar words to “car” or “minivan”

In [0]:
print(wv.most_similar(positive=["car", "minivan"], topn=5))

[('SUV', 0.853219211101532), ('vehicle', 0.8175784349441528), ('pickup_truck', 0.7763689160346985), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.756571888923645)]


Which of the below does not belong in the sequence?

In [0]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


Find a similar word to given word

In [0]:
print(wv.similar_by_word(word="car", topn=5))

[('vehicle', 0.7821096181869507), ('cars', 0.7423830032348633), ('SUV', 0.7160962820053101), ('minivan', 0.6907036304473877), ('truck', 0.6735789775848389)]


Compute Cosine similarity between words

In [0]:
print(wv.cosine_similarities(wv["car"], [wv["vehicle"]]))

[0.78210956]


Get all words that are closer to w1 than w2 is to w1.

In [0]:
print(wv.words_closer_than("car", "SUV"))

['vehicle', 'cars']


# Training Your Own Model

To start, you’ll need some data for training the model. For the following examples, we’ll use the Lee Corpus (which you already have if you’ve installed gensim).

This corpus is small enough to fit entirely in memory, but we’ll implement a memory-friendly iterator that reads it line-by-line to demonstrate how you would handle a larger corpus.

In [0]:
from gensim.test.utils import datapath
from gensim import utils

In [0]:
class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

If we wanted to do any custom preprocessing, e.g. decode a non-standard encoding, lowercase, remove numbers, extract named entities… All of this can be done inside the MyCorpus iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

Let’s go ahead and train a model on our corpus. Don’t worry about the training parameters much for now, we’ll revisit them later.

In [0]:
import gensim.models

In [0]:
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

Once we have our model, we can use it in the same way as in the demo above.

The main part of the model is model.wv, where “wv” stands for “word vectors”.

In [0]:
vec_king = model.wv['king']

Retrieving the vocabulary works the same way:

In [0]:
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

# Storing and Loading Models

You’ll notice that training non-trivial models can take time. Once you’ve trained your model and it works as expected, you can save it to disk. That way, you don’t have to spend time training it all over again later.

You can store/load models using the standard gensim methods:

In [0]:
import tempfile

In [0]:
with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)
    #
    # The model is now safely stored in the filepath.
    # You can copy it to other machines, share it with others, etc.
    #
    # To load a saved model:
    #
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

In addition, you can load models created by the original C tool, both using its text and binary formats:

In [0]:
model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
# using gzipped/bz2 input works too, no need to unzip
model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)

# Training Parameters

Word2Vec accepts several parameters that affect both training speed and quality.


# min_count

min_count is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

default value of min_count=5

In [0]:
model = gensim.models.Word2Vec(sentences, min_count=10)


# size

size is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

default value of size=100


In [0]:
model = gensim.models.Word2Vec(sentences, size=200)


# workers

workers , the last of the major parameters (full list here) is for training parallelization, to speed up training:



default value of workers=3 (tutorial says 1...)



The workers parameter only has an effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).
Memory

In [0]:
model = gensim.models.Word2Vec(sentences, workers=4)

# Memory



At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.


# Evaluating

Word2Vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the ‘datasets’ folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?).

Gensim supports the same evaluation set, in exactly the same format:

In [0]:
model.accuracy('./datasets/questions-words.txt')

This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.

In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.

By default it uses an academic dataset WS-353 but one can create a dataset specific to your business based on it. It contains word pairs together with human-assigned similarity judgments. It measures the relatedness or co-occurrence of two words. For example, ‘coast’ and ‘shore’ are very similar as they appear in the same context. At the same time ‘clothes’ and ‘closet’ are less similar because they are related but not interchangeable.

# Online training / Resuming training

Advanced users can load a model and continue training it with more sentences and new vocabulary words:

In [0]:
model = gensim.models.Word2Vec.load(temporary_filepath)

In [0]:
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']
]
model.build_vocab(more_sentences, update=True)

In [0]:
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

(27, 65)

In [0]:
# cleaning up temporary file
import os
os.remove(temporary_filepath)

You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C tool, KeyedVectors.load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

# Training Loss Computation

The parameter compute_loss can be used to toggle computation of loss while training the Word2Vec model. The computed loss is stored in the model attribute running_training_loss and can be retrieved using the function get_latest_training_loss as follows :

In [0]:
# instantiating and training the Word2Vec model
model_with_loss = gensim.models.Word2Vec(
    sentences,
    min_count=1,
    compute_loss=True,
    hs=0,
    sg=1,
    seed=42
)

In [0]:
# getting the training loss value
training_loss = model_with_loss.get_latest_training_loss()
print(training_loss)

1636602.625


# Benchmarks



Let’s run some benchmarks to see effect of the training loss computation code on training time.

We’ll use the following data for the benchmarks:

Lee Background corpus: included in gensim’s test data

Text8 corpus. To demonstrate the effect of corpus size, we’ll look at the first 1MB, 10MB, 50MB of the corpus, as well as the entire thing.


In [0]:
import io
import os

import gensim.models.word2vec
import gensim.downloader as api
import smart_open



In [0]:
def head(path, size):
    with smart_open.open(path) as fin:
        return io.StringIO(fin.read(size))


def generate_input_data():
    lee_path = datapath('lee_background.cor')
    ls = gensim.models.word2vec.LineSentence(lee_path)
    ls.name = '25kB'
    yield ls

    text8_path = api.load('text8').fn
    labels = ('1MB', '10MB', '50MB', '100MB')
    sizes = (1024 ** 2, 10 * 1024 ** 2, 50 * 1024 ** 2, 100 * 1024 ** 2)
    for l, s in zip(labels, sizes):
        ls = gensim.models.word2vec.LineSentence(head(text8_path, s))
        ls.name = l
        yield ls

In [0]:
input_data = list(generate_input_data())



We now compare the training time taken for different combinations of input data and model training parameters like hs and sg.

For each combination, we repeat the test several times to obtain the mean and standard deviation of the test duration.

In [0]:
# Temporarily reduce logging verbosity
logging.root.level = logging.ERROR

import time
import numpy as np
import pandas as pd

train_time_values = []
seed_val = 42
sg_values = [0, 1]
hs_values = [0, 1]

fast = True
if fast:
    input_data_subset = input_data[:3]
else:
    input_data_subset = input_data


for data in input_data_subset:
    for sg_val in sg_values:
        for hs_val in hs_values:
            for loss_flag in [True, False]:
                time_taken_list = []
                for i in range(3):
                    start_time = time.time()
                    w2v_model = gensim.models.Word2Vec(
                        data,
                        compute_loss=loss_flag,
                        sg=sg_val,
                        hs=hs_val,
                        seed=seed_val,
                    )
                    time_taken_list.append(time.time() - start_time)

                time_taken_list = np.array(time_taken_list)
                time_mean = np.mean(time_taken_list)
                time_std = np.std(time_taken_list)

                model_result = {
                    'train_data': data.name,
                    'compute_loss': loss_flag,
                    'sg': sg_val,
                    'hs': hs_val,
                    'train_time_mean': time_mean,
                    'train_time_std': time_std,
                }
                print("Word2vec model #%i: %s" % (len(train_time_values), model_result))
                train_time_values.append(model_result)

train_times_table = pd.DataFrame(train_time_values)
train_times_table = train_times_table.sort_values(
    by=['train_data', 'sg', 'hs', 'compute_loss'],
    ascending=[False, False, True, False],
)
print(train_times_table)

# Production Pipeline Word2Vec

Suppose, we still want more performance improvement in production.

One good way is to cache all the similar words in a dictionary.

So that next time when we get the similar query word, we’ll search it first in the dict.

And if it’s a hit then we will show the result directly from the dictionary.

otherwise we will query the word and then cache it so that it doesn’t miss next time.

In [0]:
import time
words = ['voted', 'few', 'their', 'around']

without caching

In [0]:
start = time.time()
for word in words:
    result = wv.most_similar(word)
    print(result)
end = time.time()
print(end - start)

NameError: ignored

In [0]:
# with caching
start = time.time()
for word in words:
    if 'voted' in most_similars_precalc:
        result = most_similars_precalc[word]
        print(result)
    else:
        result = wv.most_similar(word)
        most_similars_precalc[word] = result
        print(result)

end = time.time()
print(end - start)