Author: Kaveh Mahdavi <kavehmahdavi74@yahoo.com>
License: BSD 3 clause
last update: 17/01/2023

# Word Embedding Representation

To represent text as a tensor, I create on high-dimensional bag-of-words vectors with length vocab_size.
Then I explicitly converted low-dimensional positional representation vectors into sparse one-hot representation, but:
1. It is not memory-efficient
2. Each word is treated independently from each other.
3. One-hot encoded vectors don't express semantic similarities between words

We use word embedding whcih is a method that requires both in the total amount of data and repeated occurrences
of individual exemplars, and long training time. The result is a dense vector with a fixed, arbitrary number of dimensions.

They also differ at the prediction stage:
* One-Hot Encoding tells you nothing of the semantics of the items; each vectorization is an orthogonal representation in another dimension.
* Embeddings will group commonly co-occurring items together in the representation space.

In [2]:
import sys
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import gensim.downloader as api
import numpy as np
from gensim.models import FastText

In [3]:
# To use GPU memory cautiously, I set tensorflow option to grow GPU memory allocation when needed.
physical_devices = tf.config.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

### Load Dataset

I continue exploring the News AG dataset. I load the data and get some definitions from the previous unit.

In [10]:
ds_train, ds_test = tfds.load('ag_news_subset').values()

print("Size of train dataset: {}".format(len(ds_train)))
print("Size of test dataset:  {}".format(len(ds_test)))

Size of train dataset: 120000
Size of test dataset:  7600


### Embedding

I use embedding is to represent words using lower-dimensional dense vectors that reflect the semantic meaning of the word. An embedding layer takes a word as input, and produces an output vector of specified embedding_size.
In a sense, it is very similar to a `Dense` layer, but instead of taking a one-hot encoded vector as input, it's able to take a word number.

As a result, the classifier neural network consists of the following layers:

* `TextVectorization` layer, which takes a string as input, and produces a tensor of token numbers. We will specify some reasonable vocabulary size `vocab_size`, and ignore less-frequently used words. The input shape will be 1, and the output shape will be $n$, since we'll get $n$ tokens as a result, each of them containing numbers from 0 to `vocab_size`.
* `Embedding` layer, which takes $n$ numbers, and reduces each number to a dense vector of a given length (120 in our
example). Thus, the input tensor of shape $n$ will be transformed into an $n\times 100$ tensor.
* Aggregation layer, which takes the average of this tensor along the first axis, i.e. it will compute the average of all $n$ input tensors corresponding to different words. To implement this layer, we will use a `Lambda` layer, and pass into it the function to compute the average. The output will have shape of 100, and it will be the numeric representation of the whole input sequence.
* Final `Dense` linear classifier.

In [11]:
vocab_size = 10000
batch_size = 128

vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size, input_shape=(1,))

model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, 120),
    keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),
    keras.layers.Dense(4, activation='softmax')
])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, None, 120)         1200000   
                                                                 
 lambda_1 (Lambda)           (None, 120)               0         
                                                                 
 dense_1 (Dense)             (None, 4)                 484       
                                                                 
Total params: 1,200,484
Trainable params: 1,200,484
Non-trainable params: 0
_________________________________________________________________


In [12]:
def to_tuple(_x):
    return _x['title'] + ' ' + _x['description'], _x['label']

**Note:** See output shape column:
* The first tensor dimension None corresponds to the minibatch size
* The second corresponds to the length of the token sequence.

In [13]:
ds_train_embed = ds_train.map(to_tuple).batch(batch_size)
ds_test_embed = ds_test.map(to_tuple).batch(batch_size)

print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(lambda x: x['title'] + ' ' + x['description']))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train_embed, validation_data=ds_test_embed)

Training vectorizer
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089




<keras.callbacks.History at 0x7f6f2afc9550>

**Note:** All token sequences in the minibatch have different lengths. We'll discuss how to deal with it in the next section.

### Uniform Sequence Sizes

If I apply the `TextVectorization` layer to a single input, the number of tokens returned is different, depending on
how the text is tokenized:

In [13]:
print(vectorizer('Hello friend'))
print(vectorizer('What are you doing with AI'))

tf.Tensor([1 1], shape=(2,), dtype=int64)
tf.Tensor([ 333   37  158 4499   11    1], shape=(6,), dtype=int64)


By applying the `vectorizer` to several sequences, it has to produce a tensor of rectangular shape, so it fills unused
elements with the PAD token (which could be zero), and do the embedding:

In [17]:
vectorizer(['Hello friend', 'What are you doing with AI'])

<tf.Tensor: shape=(2, 6), dtype=int64, numpy=
array([[   1,    1,    0,    0,    0,    0],
       [ 333,   37,  158, 4499,   11,    1]])>

In [18]:
model.layers[1](vectorizer(['Hello friend', 'What are you doing with AI'])).numpy()

array([[[-0.03438212, -0.05019719,  0.01805012, ...,  0.06051063,
          0.08801311,  0.03039854],
        [-0.03438212, -0.05019719,  0.01805012, ...,  0.06051063,
          0.08801311,  0.03039854],
        [ 0.01830896,  0.00465834,  0.02408492, ..., -0.04267353,
          0.03699594, -0.0320233 ],
        [ 0.01830896,  0.00465834,  0.02408492, ..., -0.04267353,
          0.03699594, -0.0320233 ],
        [ 0.01830896,  0.00465834,  0.02408492, ..., -0.04267353,
          0.03699594, -0.0320233 ],
        [ 0.01830896,  0.00465834,  0.02408492, ..., -0.04267353,
          0.03699594, -0.0320233 ]],

       [[-0.25439277,  0.17742102, -0.20666021, ...,  0.23273583,
          0.2606135 , -0.15447445],
        [ 0.05197402,  0.05718319, -0.14672998, ...,  0.06464478,
          0.00392407, -0.15935227],
        [-0.10426484,  0.21083392, -0.2192003 , ...,  0.13561691,
          0.14087412, -0.29651752],
        [ 0.1050437 ,  0.02400611, -0.06014836, ..., -0.0729761 ,
         -0.08

## Semantic Embeddings

### Word2Vec

In our previous example, these representations did not have semantic meaning. A vector should represent similar words
 or synonyms correspond to vectors that are close to each other in terms of some vector distance (for example euclidian distance).

I need to pretrain our embedding model on a large text collection by [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) technique.
Three is two approaches to produce a distributed representation of words:

 - **Continuous bag-of-words** (CBoW), where I train the model to predict a word from the surrounding context. Given
 the ngram $(W_{-2},W_{-1},W_0,W_1,W_2)$, the model predicts $W_0$ from $(W_{-2},W_{-1},W_1,W_2)$.
 - **Continuous skip-gram** is the opposite of CBoW. The model uses the input word ($W_0$) to predict the surrounding window of context words.

CBoW is faster, and while skip-gram is slower, it does a better job of representing infrequent words.

![Image showing both CBoW and Skip-Gram algorithms to convert words to vectors.](../img/converting-words-to-vectors.png)

To apply the Word2Vec embedding, we can use the **gensim** library.
let’s download the `glove-wiki-gigaword-50` corpus and load it as a Python object that supports streamed access.

In [5]:
_word2vec = api.load('glove-wiki-gigaword-50')

Now let's find the words most similar to 'math'.

In [6]:
for index, word in enumerate(_word2vec.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(_word2vec.index_to_key)} is {word}")

word #0/400000 is the
word #1/400000 is ,
word #2/400000 is .
word #3/400000 is of
word #4/400000 is to
word #5/400000 is and
word #6/400000 is in
word #7/400000 is a
word #8/400000 is "
word #9/400000 is 's


In [130]:
pairs = [
    ('car', 'minivan'),  # a minivan is a kind of car
    ('car', 'bicycle'),  # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),  # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, _word2vec.similarity(w1, w2)))

'car'	'minivan'	0.72
'car'	'bicycle'	0.76
'car'	'airplane'	0.74
'car'	'cereal'	0.18
'car'	'communism'	0.07


In [33]:
for w, p in _word2vec.most_similar('math'):
    print(f"{w} -> {p}")

maths -> 0.7655045390129089
curriculum -> 0.754166841506958
graders -> 0.7464368939399719
instruction -> 0.7285575270652771
grades -> 0.7256329655647278
undergraduate -> 0.712658166885376
mathematics -> 0.7076627612113953
exams -> 0.6997538805007935
teaching -> 0.6977996826171875
courses -> 0.6964027285575867


For the classification modeling, I can use the extracted the vector embedding from the word. The embedding has 50
components.

In [37]:
print(_word2vec['love'])

[-0.13886    1.1401    -0.85212   -0.29212    0.75534    0.82762
 -0.3181     0.0072204 -0.34762    1.0731    -0.24665    0.97765
 -0.55835   -0.090318   0.83182   -0.33317    0.22648    0.30913
  0.026929  -0.086739  -0.14703    1.3543     0.53695    0.43735
  1.2749    -1.4382    -1.2815    -0.15196    1.0506    -0.93644
  2.7561     0.58967   -0.29473    0.27574   -0.32928   -0.201
 -0.28547   -0.45987   -0.14603   -0.69372    0.070761  -0.19326
 -0.1855    -0.16095    0.24268    0.20784    0.030924  -1.3711
 -0.28606    0.2898   ]


With semantic embeddings, I can manipulate the vector encoding based on semantics. E.g., I can look for a word
whose vector representation is as close as possible to the words 'earth' and 'moon', and as far as possible from the
word 'son':

In [71]:
_word2vec.most_similar(positive=['earth', 'moon'], negative=['son'])[0]

('mars', 0.7735902070999146)

In [133]:
_word2vec.doesnt_match('fire water land sea air car'.split())

'car'

It is a vector operations:
1. Calculate the vector corresponding to KING-MAN+WOMAN (operations + and - are performed on vector representations of corresponding words)
2. Find the closest word in the dictionary to that vector.

In [78]:
vec = _word2vec['earth'] - _word2vec['son'] + _word2vec['moon']
d = np.sum((_word2vec.vectors - vec) ** 2, axis=1)  # index of the closest embedding vector
min_idx = np.argmin(d)
_word2vec.index_to_key[min_idx]

'earth'

Word2Vec has many disadvantages, such as:
* Both CBoW and skip-gram models are **predictive embeddings**, and they only take local context into account. Word2Vec does not take advantage of global context.
* Word2Vec does not take into account word **morphology**, i.e. the fact that the meaning of the word can depend on different parts of the word, such as the root.

The `FastText` model can overcome second issues.

### FastText

It overcomes the second issue, and builds on Word2Vec by learning vector representations for each word and the character n-grams found within each word.
The values of the representations are then averaged into one vector at each training step.
While this adds a lot of additional computation to pretraining, it enables word embeddings to encode sub-word information.

In [92]:
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(corpus_file=corpus_file,
            epochs=model.epochs,
            total_examples=model.corpus_count,
            total_words=model.corpus_total_words, )

print(model)

FastText<vocab=1762, vector_size=100, alpha=0.025>


In [104]:
# Word vector lookup
wv = model.wv
print('read' in wv.key_to_index)
print('reads' in wv.key_to_index)
print(wv['read'])

True
False
[-0.24528746  0.27193153 -0.34032896 -0.12382222  0.06654335  0.45437434
  0.27720225  0.53884196  0.30880558 -0.23876895  0.02806783 -0.19655971
 -0.24001622  0.5301068  -0.45379153 -0.6253118   0.24313356 -0.30311784
 -0.4984473  -0.6274031  -0.56004435 -0.05712337 -0.44443405 -0.12025888
 -0.24422058 -0.41321412 -0.8320234  -0.13387157 -0.415995    0.34590375
 -0.3994049   0.34339675  0.94730055 -0.29650086  0.22428054  0.50993854
  0.42656848 -0.14222454 -0.46551844 -0.4307183   0.48203936 -0.46428716
  0.02117989 -0.48691463 -0.5408621  -0.31460562 -0.10765355  0.13655096
  0.48128426  0.00842677  0.3880696  -0.46757853  0.3134806  -0.4678918
 -0.20688201 -0.21745946 -0.15032953 -0.16580129  0.0539568  -0.40048352
 -0.36010033 -0.510874   -0.17036255  0.39725602 -0.1923789   0.77905256
  0.09166505  0.08640496  0.50800806  0.21683708 -0.28926727  0.41293505
  0.54126114 -0.7425832   0.41529313 -0.12393712  0.31253526  0.00776238
  0.04701365  0.4486477   0.21676937 -0.5

In [105]:
# Similarity operations
print(wv.similarity("read", "reads"))

0.9999865


Syntactically similar words generally have high similarity in fastText models, since a large number of the component char-ngrams will be the same. As a result, fastText generally does better at syntactic tasks than Word2Vec.

#### Other similarity operations

In [107]:
wv.most_similar("read")

[('reached', 0.9999895095825195),
 ('real', 0.9999886751174927),
 ('hearing', 0.9999877214431763),
 ('ready', 0.9999877214431763),
 ('starting', 0.9999864101409912),
 ('acting', 0.9999861717224121),
 ('really', 0.9999860525131226),
 ('working', 0.9999859929084778),
 ('threatened', 0.9999859929084778),
 ('playing', 0.9999856948852539)]

In [108]:
wv.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])

0.99994016

In [122]:
wv.doesnt_match("breakfast cereal dinner lunch".split())

'lunch'

In [123]:
wv.most_similar(positive=['baghdad', 'england'], negative=['london'])

[('capital,', 0.9996482729911804),
 ('find', 0.9996472001075745),
 ('field', 0.9996403455734253),
 ('findings', 0.9996394515037537),
 ('seekers.', 0.9996380805969238),
 ('finding', 0.9996365308761597),
 ('abuse', 0.9996364116668701),
 ('had', 0.9996358156204224),
 ('storm', 0.9996349215507507),
 ('26-year-old', 0.999632716178894)]

In [124]:
wv.evaluate_word_analogies(datapath('questions-words.txt'))

(0.25510204081632654,
 [{'section': 'capital-common-countries', 'correct': [], 'incorrect': []},
  {'section': 'capital-world', 'correct': [], 'incorrect': []},
  {'section': 'currency', 'correct': [], 'incorrect': []},
  {'section': 'city-in-state', 'correct': [], 'incorrect': []},
  {'section': 'family',
   'correct': [],
   'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')]},
  {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},
  {'section': 'gram2-opposite', 'correct': [], 'incorrect': []},
  {'section': 'gram3-comparative',
   'correct': [('LONG', 'LONGER', 'GREAT', 'GREATER')],
   'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
    ('GOOD', 'BETTER', 'LONG', 'LONGER'),
    ('GOOD', 'BETTER', 'LOW', 'LOWER'),
    ('GREAT', 'GREATER', 'LONG', 'LONGER'),
    ('GREAT', 'GREATER', 'LOW', 'LOWER'),
    ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
    ('LONG', 'LONGER', 'LOW', 'LOWER'),
    ('LONG', 'LONGER', 'GOOD', 'BETTER'),
    ('LOW', 'L

#### Word Movers distance
Let’s start with two sentences:


In [125]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

In [126]:
# Remove their stopwords.
from gensim.parsing.preprocessing import STOPWORDS

sentence_obama = [w for w in sentence_obama if w not in STOPWORDS]
sentence_president = [w for w in sentence_president if w not in STOPWORDS]

In [127]:
# Compute the Word Movers Distance between the two sentences.
distance = wv.wmdistance(sentence_obama, sentence_president)
print(f"Word Movers Distance is {distance} (lower means closer)")

Word Movers Distance is 0.015816832268444662 (lower means closer)


#### Visualising Word Embeddings

In [138]:
from sklearn.manifold import TSNE  # final reduction
import umap.umap_ as umap
import numpy as np  # array handling
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go


def reduce_dimensions(model, _method='umap'):
    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    # extract the words & their vectors, as numpy arrays
    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index_to_key)  # fixed-width numpy strings

    # reduce
    if _method == 'tsne':
        _reduce = TSNE(n_components=num_dimensions, random_state=0)
    else:
        _reduce = umap.UMAP()

    vectors = _reduce.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels


x_vals, y_vals, labels = reduce_dimensions(model)


def plot_with_plotly(x_vals, y_vals, labels, plot_in_notebook=True):
    trace = go.Scatter(x=x_vals, y=y_vals, mode='text', text=labels)
    data = [trace]

    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')


def plot_with_matplotlib(x_vals, y_vals, labels):
    import matplotlib.pyplot as plt
    import random

    random.seed(0)

    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals, y_vals)

    #
    # Label randomly subsampled 25 data points
    #
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, 25)
    for i in selected_indices:
        plt.annotate(labels[i], (x_vals[i], y_vals[i]))


try:
    get_ipython()
except Exception:
    plot_function = plot_with_matplotlib
else:
    plot_function = plot_with_plotly

plot_function(x_vals, y_vals, labels)

### Pretrained Embeddings Layers

We can modify the example above to prepopulate the matrix in our embedding layer with semantic embeddings, such as Word2Vec.
The vocabularies of the pretrained embedding and the text corpus will likely not match, so we need to choose one.
There are two possible options:
* tokenizer vocabulary
* vocabulary from Word2Vec embeddings

#### Tokenizer Vocabulary

In this case, some words from the vocabulary will have corresponding Word2Vec embeddings, and some will be missing.
The given vocabulary size is `vocab_size`, and the Word2Vec embedding vector length is `embed_size`, the embedding
layer is repesented by a weight matrix of shape `vocab_size`$\times$`embed_size`.
We will populate this matrix by going through the vocabulary:

In [20]:
embed_size = len(_word2vec.get_vector('math'))
print(f'Embedding size: {embed_size}')

vocab = vectorizer.get_vocabulary()
W = np.zeros((vocab_size, embed_size))
print('Populating matrix, this will take some time...', end='')
found, not_found = 0, 0
for i, w in enumerate(vocab):
    try:
        W[i] = _word2vec.get_vector(w)
        found += 1
    except:
        not_found += 1
print(f"Done, found {found} words, {not_found} words missing")

Embedding size: 50
Populating matrix, this will take some time...Done, found 4961 words, 374 words missing


For words that are not present in the Word2Vec vocabulary, I can either leave them as zeroes, or generate a random
vector.
Then I can define an embedding layer with pretrained weights:

In [18]:
# trainable=False when creating the Embedding, which means that we're not retraining the Embedding layer.
# This may cause accuracy to be slightly lower, but it speeds up the training.
emb_layer = keras.layers.Embedding(vocab_size, embed_size, weights=[W], trainable=False)
model = keras.models.Sequential([vectorizer,
                                 emb_layer,
                                 keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),
                                 keras.layers.Dense(4, activation='softmax')])

In [19]:
ds_train_embed = ds_train.map(to_tuple).batch(batch_size)
ds_test_embed = ds_test.map(to_tuple).batch(batch_size)

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train_embed, validation_data=ds_test_embed)



<keras.callbacks.History at 0x7f6f457002e0>

#### Embedding Vocabulary

One issue with the previous approach is that the vocabularies used in the TextVectorization and Embedding are different.
To overcome this problem, we can use one of the following solutions:
* Re-train the Word2Vec model on our vocabulary.
* Load our dataset with the vocabulary from the pretrained Word2Vec model. Vocabularies used to load the dataset can be specified during loading.

The latter approach seems easier, so let's implement it. First of all, we will create a `TextVectorization` layer with the specified vocabulary, taken from the Word2Vec embeddings:

In [21]:
vocab = list(_word2vec.vocab.keys())
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(input_shape=(1,))
vectorizer.set_vocabulary(vocab)

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4