## Word2Vec Model

Word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like:<br>

vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)<br>

vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) =~ vec(“Toronto Maple Leafs”).<br>
<br>
Word2vec is very useful in automatic text tagging, recommender systems and machine translation.<br>

Bag-of-words models are surprisingly effective, but have several weaknesses.

First, they lose all information about word order: “John likes Mary” and “Mary likes John” correspond to identical vectors. There is a solution: bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.

Second, the model does not attempt to learn the meaning of the underlying words, and as a consequence, the distance between vectors doesn’t always reflect the difference in meaning. The Word2Vec model addresses this second problem.

### Introducing: the Word2Vec Model
Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, strong and powerful would be close together and strong and Paris would be relatively far.

The are two versions of this model and Word2Vec class implements them both:

1. Skip-grams (SG)

2. Continuous-bag-of-words (CBOW)

The Word2Vec Skip-gram model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual one-hot encoding of words goes through a ‘projection layer’ to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.

Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings.

In [9]:
from gensim.test.utils import datapath
from gensim import utils
import gensim.models


In [10]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

In [2]:
#retrieve the vocabulary of a model
for i, word in enumerate(wv.vocab):
    if i == 10:
        break
    print(word)

</s>
in
for
that
is
on
##
The
with
said


In [3]:
#obtain vectors for terms the model
vec_king = wv['king']

Similarity intuitively decreases as the words get less and less similar.

In [4]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


In [7]:
#most similar words to “car” or “minivan”
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

#does not belong in the sequence
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

[('SUV', 0.853219211101532), ('vehicle', 0.8175784349441528), ('pickup_truck', 0.7763689160346985), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.756571888923645)]
car


### Training the Model

In [8]:
class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [11]:
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

In [13]:
vec_king = model.wv['king']

for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

hundreds
of
people
have
been
forced
to
their
homes
in


### Training Parameters

#### min_count
min_count is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

default value of min_count=5

In [14]:
model = gensim.models.Word2Vec(sentences, min_count=10)

#### size
size is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

In [15]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

#### workers
workers , the last of the major parameters (full list here) is for training parallelization, to speed up training:

In [16]:
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)

#### Memory
At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.

#### Evaluating
Word2Vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the ‘datasets’ folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?).

In [None]:
model.accuracy('./datasets/questions-words.txt')

### Training Loss Computation

The parameter compute_loss can be used to toggle computation of loss while training the Word2Vec model. The computed loss is stored in the model attribute running_training_loss and can be retrieved using the function get_latest_training_loss as follows :

In [18]:
# instantiating and training the Word2Vec model
model_with_loss = gensim.models.Word2Vec(
    sentences,
    min_count=1,
    compute_loss=True,
    hs=0,
    sg=1,
    seed=42
)

# getting the training loss value
training_loss = model_with_loss.get_latest_training_loss()
print(training_loss)

1360505.375


### Benchmarks
Let’s run some benchmarks to see effect of the training loss computation code on training time.

We’ll use the following data for the benchmarks:

1. Lee Background corpus: included in gensim’s test data

2. Text8 corpus. To demonstrate the effect of corpus size, we’ll look at the first 1MB, 10MB, 50MB of the corpus, as well as the entire thing.

In [19]:
import io
import os

import gensim.models.word2vec
import gensim.downloader as api
import smart_open


def head(path, size):
    with smart_open.open(path) as fin:
        return io.StringIO(fin.read(size))


def generate_input_data():
    lee_path = datapath('lee_background.cor')
    ls = gensim.models.word2vec.LineSentence(lee_path)
    ls.name = '25kB'
    yield ls

    text8_path = api.load('text8').fn
    labels = ('1MB', '10MB', '50MB', '100MB')
    sizes = (1024 ** 2, 10 * 1024 ** 2, 50 * 1024 ** 2, 100 * 1024 ** 2)
    for l, s in zip(labels, sizes):
        ls = gensim.models.word2vec.LineSentence(head(text8_path, s))
        ls.name = l
        yield ls


input_data = list(generate_input_data())

We now compare the training time taken for different combinations of input data and model training parameters like hs and sg.

For each combination, we repeat the test several times to obtain the mean and standard deviation of the test duration.

In [22]:
import time
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import numpy as np
import pandas as pd

# Temporarily reduce logging verbosity
logging.root.level = logging.ERROR

train_time_values = []
seed_val = 42
sg_values = [0, 1]
hs_values = [0, 1]

fast = True
if fast:
    input_data_subset = input_data[:3]
else:
    input_data_subset = input_data


for data in input_data_subset:
    for sg_val in sg_values:
        for hs_val in hs_values:
            for loss_flag in [True, False]:
                time_taken_list = []
                for i in range(3):
                    start_time = time.time()
                    w2v_model = gensim.models.Word2Vec(
                        data,
                        compute_loss=loss_flag,
                        sg=sg_val,
                        hs=hs_val,
                        seed=seed_val,
                    )
                    time_taken_list.append(time.time() - start_time)

                time_taken_list = np.array(time_taken_list)
                time_mean = np.mean(time_taken_list)
                time_std = np.std(time_taken_list)

                model_result = {
                    'train_data': data.name,
                    'compute_loss': loss_flag,
                    'sg': sg_val,
                    'hs': hs_val,
                    'train_time_mean': time_mean,
                    'train_time_std': time_std,
                }
                print("Word2vec model #%i: %s" % (len(train_time_values), model_result))
                train_time_values.append(model_result)

train_times_table = pd.DataFrame(train_time_values)
train_times_table = train_times_table.sort_values(
    by=['train_data', 'sg', 'hs', 'compute_loss'],
    ascending=[False, False, True, False],
)
print(train_times_table)

Word2vec model #0: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 0, 'train_time_mean': 0.5842417081197103, 'train_time_std': 0.023061585599197555}
Word2vec model #1: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 0, 'train_time_mean': 0.5968351364135742, 'train_time_std': 0.013687334346923204}
Word2vec model #2: {'train_data': '25kB', 'compute_loss': True, 'sg': 0, 'hs': 1, 'train_time_mean': 0.8884803454081217, 'train_time_std': 0.08638033373526041}
Word2vec model #3: {'train_data': '25kB', 'compute_loss': False, 'sg': 0, 'hs': 1, 'train_time_mean': 0.7722438176472982, 'train_time_std': 0.025149038378122435}
Word2vec model #4: {'train_data': '25kB', 'compute_loss': True, 'sg': 1, 'hs': 0, 'train_time_mean': 0.8475955327351888, 'train_time_std': 0.054728935366151456}
Word2vec model #5: {'train_data': '25kB', 'compute_loss': False, 'sg': 1, 'hs': 0, 'train_time_mean': 0.862225612004598, 'train_time_std': 0.04703722128200267}
Word2vec model #6: {'train_data': 

In [23]:
most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}
for i, (key, value) in enumerate(most_similars_precalc.items()):
    if i == 3:
        break
    print(key, value)

the [('by', 0.9999262094497681), ('being', 0.9999250769615173), ('an', 0.9999247789382935), ('on', 0.9999246597290039), ('and', 0.9999232292175293), ('with', 0.9999204277992249), ('at', 0.9999191761016846), ('of', 0.9999165534973145), ('sydney', 0.9999163150787354), ('in', 0.9999150037765503)]
to [('but', 0.9999604225158691), ('is', 0.9999570846557617), ('into', 0.9999527931213379), ('are', 0.9999514818191528), ('for', 0.9999501705169678), ('over', 0.9999499320983887), ('them', 0.9999480843544006), ('from', 0.9999476671218872), ('out', 0.9999474287033081), ('with', 0.9999464750289917)]
of [('with', 0.999951958656311), ('in', 0.999948263168335), ('and', 0.999948263168335), ('by', 0.9999479055404663), ('three', 0.9999462366104126), ('at', 0.9999454021453857), ('for', 0.999944806098938), ('from', 0.9999441504478455), ('after', 0.9999438524246216), ('on', 0.9999422430992126)]


### Comparison with and without caching

In [24]:
words = ['voted', 'few', 'their', 'around']

#Without caching 
start = time.time()
for word in words:
    result = model.wv.most_similar(word)
    print(result)
end = time.time()
print(end - start)

[('take', 0.9984893202781677), ('overnight', 0.998449981212616), ('tora', 0.9984446167945862), ('israeli', 0.9984394311904907), ('had', 0.9984369277954102), ('shane', 0.9984357953071594), ('place', 0.9984205365180969), ('hour', 0.9984184503555298), ('warne', 0.9984147548675537), ('into', 0.9984145760536194)]
[('on', 0.9997923970222473), ('into', 0.9997806549072266), ('in', 0.9997788667678833), ('military', 0.9997775554656982), ('two', 0.9997754693031311), ('were', 0.9997754096984863), ('and', 0.9997727870941162), ('from', 0.999770998954773), ('an', 0.9997692108154297), ('before', 0.9997677803039551)]
[('before', 0.9999513030052185), ('out', 0.9999500513076782), ('about', 0.9999487400054932), ('when', 0.9999478459358215), ('as', 0.9999445676803589), ('which', 0.9999439716339111), ('would', 0.999943733215332), ('into', 0.9999432563781738), ('with', 0.9999430179595947), ('on', 0.9999410510063171)]
[('from', 0.999925971031189), ('has', 0.9999244213104248), ('with', 0.9999223947525024), ('t

In [25]:
#With caching
start = time.time()
for word in words:
    if 'voted' in most_similars_precalc:
        result = most_similars_precalc[word]
        print(result)
    else:
        result = model.wv.most_similar(word)
        most_similars_precalc[word] = result
        print(result)

end = time.time()
print(end - start)

[('take', 0.9984893202781677), ('overnight', 0.998449981212616), ('tora', 0.9984446167945862), ('israeli', 0.9984394311904907), ('had', 0.9984369277954102), ('shane', 0.9984357953071594), ('place', 0.9984205365180969), ('hour', 0.9984184503555298), ('warne', 0.9984147548675537), ('into', 0.9984145760536194)]
[('on', 0.9997923970222473), ('into', 0.9997806549072266), ('in', 0.9997788667678833), ('military', 0.9997775554656982), ('two', 0.9997754693031311), ('were', 0.9997754096984863), ('and', 0.9997727870941162), ('from', 0.999770998954773), ('an', 0.9997692108154297), ('before', 0.9997677803039551)]
[('before', 0.9999513030052185), ('out', 0.9999500513076782), ('about', 0.9999487400054932), ('when', 0.9999478459358215), ('as', 0.9999445676803589), ('which', 0.9999439716339111), ('would', 0.999943733215332), ('into', 0.9999432563781738), ('with', 0.9999430179595947), ('on', 0.9999410510063171)]
[('from', 0.999925971031189), ('has', 0.9999244213104248), ('with', 0.9999223947525024), ('t

### Visualising the Word Embeddings

In [26]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling


def reduce_dimensions(model):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = [] # positions in vector space
    labels = [] # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model.wv[word])
        labels.append(word)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)

    # reduce using t-SNE
    vectors = np.asarray(vectors)
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)

def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):
    from plotly.offline import init_notebook_mode, iplot, plot
    import plotly.graph_objs as go

    trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)
    data = [trace]

    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))

try:
    get_ipython()
except Exception:
    plot_function = plot_with_matplotlib
else:
    plot_function = plot_with_plotly

plot_function(x_vals, y_vals, labels)