## Documentation/Sources
* [https://radimrehurek.com/gensim/models/word2vec.html](https://radimrehurek.com/gensim/models/word2vec.html) for more information about how to use gensim word2vec in general
* _Blog post has been removed_ [https://codekansas.github.io/blog/2016/gensim.html](https://codekansas.github.io/blog/2016/gensim.html) for information about using it to create embedding layers for neural networks.
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) for using pre-trained embeddings with keras (though the syntax they use for the model layers is different than most other tutorials).
* [https://keras.io/](https://keras.io/) Keras API documentation

In [124]:
from gensim.models import word2vec
import numpy as np
import keras
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, Dense, Flatten, ReLU, GlobalMaxPooling1D
from tensorflow.keras import layers

## Load Trained Word Vectors

Load the trained model file into memory

In [125]:
wv_model = word2vec.Word2Vec.load('1billion_word_vectors')

Since we do not need to continue training the model, we can save memory by keeping the parts we need (the word vectors themselves) and getting rid of the rest of the model.

In [126]:
wordvec = wv_model.wv
del wv_model

Let's see what one of these vectors actually looks like.

In [127]:
wordvec['textbook']

array([ 0.50756323, -2.8890731 ,  0.9743826 , -0.60089743, -0.23762947,
       -2.324566  , -0.64634913, -0.66476715, -2.3432739 ,  1.4446437 ,
       -0.15542823,  1.8248576 ,  1.1309539 , -0.21071543, -0.82512087,
       -0.2773584 , -0.1973424 , -0.5337731 ,  2.1143918 ,  1.0673765 ,
       -0.2341243 ,  1.5292411 ,  0.66977274,  1.1214821 , -0.57710004,
       -0.02504024,  0.6074397 ,  0.19416903, -1.1265849 , -0.6618393 ,
        1.7525213 ,  1.6232891 , -0.3886833 , -1.1867149 ,  0.45511633,
        1.4240934 , -0.87929034, -1.8920534 ,  2.6986032 , -0.5277589 ,
        2.1202435 ,  0.62670445,  1.0352231 ,  1.4998924 ,  2.5809426 ,
        0.74698585, -0.07757699, -0.67074645,  1.6887746 , -0.22081567,
        1.2107906 ,  0.16741815,  3.3496742 ,  1.1832954 ,  0.4423463 ,
        0.04771314, -0.14557275, -1.3345221 ,  1.3236852 ,  2.0154989 ,
       -0.6510446 ,  0.21808812, -0.31578887, -1.822629  ,  0.8436349 ,
       -1.1500564 ,  1.24044   , -2.6430037 ,  1.0617311 ,  1.20

## Use Word Vectors in an Embedding Layer of a Keras Model

In [128]:
# get_keras_embedding() was deprecated, instead I used this
# from https://github.com/RaRe-Technologies/gensim/wiki/Using-Gensim-Embeddings-with-Keras-and-Tensorflow

def gensim_to_keras_embedding(keyed_vectors, train_embeddings=False):
    weights = keyed_vectors.vectors  # vectors themselves, a 2D numpy array    
    index_to_key = keyed_vectors.index_to_key  # which row in `weights` corresponds to which word?

    layer = Embedding(
        input_dim=weights.shape[0],
        output_dim=weights.shape[1],
        weights=[weights],
        trainable=train_embeddings,
    )
    return layer

You may have noticed in the help text for wordvec that it has a built-in method for converting into a Keras embedding layer.

Since for this experimentation, we'll just be giving the embedding layer one word at a time, we can set the input length to 1.

In [129]:
# test_embedding_layer = wordvec.get_keras_embedding()
test_embedding_layer = gensim_to_keras_embedding(wordvec)
test_embedding_layer.input_length = 1

In [130]:
embedding_model = Sequential()
embedding_model.add(test_embedding_layer)

But how do we actually use this? If you look at the [Keras Embedding Layer documentation](https://keras.io/layers/embeddings/) you might notice that it takes numerical input, not strings. How do we know which number corresponds to a particular word? In addition to having a vector, each word has an index:

In [131]:
wordvec.key_to_index['python']

30438

Let's see if we get the same vector from the embedding layer as we get from our word vector object.

In [132]:
wordvec['python']

array([-1.1750487e+00,  2.3066440e-04, -6.0706180e-01, -1.1156354e+00,
       -1.0580894e+00, -2.7154784e+00, -3.6140988e+00, -1.0810910e+00,
        1.1234255e+00, -7.7326834e-01, -1.3322397e+00,  9.2905626e-02,
       -2.4488842e+00, -1.7817341e-01, -3.5459950e+00, -1.7320968e+00,
        1.9397168e+00, -6.3734710e-01,  2.3254216e+00, -1.3535864e+00,
       -1.4451812e-01, -2.4297442e+00,  1.5498929e+00,  8.1969726e-01,
        9.0982294e-01, -6.6116208e-01,  3.8905215e-01,  3.3855909e-01,
       -7.5454485e-01, -1.0352553e+00, -2.5936973e+00,  1.2103225e+00,
       -3.0236175e+00,  3.0580134e+00, -3.9140179e+00,  4.0223894e-01,
        1.7356061e+00,  9.0976155e-01,  2.0956397e-02,  2.0190549e+00,
        4.5332021e-01, -1.6634842e+00, -4.8180079e-01,  2.0414692e-01,
       -5.9267312e-01, -1.4182589e+00, -9.7301149e-01,  5.1611459e-01,
        2.0727324e+00,  2.0064230e+00, -7.5027935e-02, -1.1723986e+00,
       -8.6943096e-01,  1.7028141e+00,  2.2190344e+00,  9.3605727e-01,
      

In [133]:
embedding_model.predict(np.array([[30438]]))

array([[[-1.1750487e+00,  2.3066440e-04, -6.0706180e-01, -1.1156354e+00,
         -1.0580894e+00, -2.7154784e+00, -3.6140988e+00, -1.0810910e+00,
          1.1234255e+00, -7.7326834e-01, -1.3322397e+00,  9.2905626e-02,
         -2.4488842e+00, -1.7817341e-01, -3.5459950e+00, -1.7320968e+00,
          1.9397168e+00, -6.3734710e-01,  2.3254216e+00, -1.3535864e+00,
         -1.4451812e-01, -2.4297442e+00,  1.5498929e+00,  8.1969726e-01,
          9.0982294e-01, -6.6116208e-01,  3.8905215e-01,  3.3855909e-01,
         -7.5454485e-01, -1.0352553e+00, -2.5936973e+00,  1.2103225e+00,
         -3.0236175e+00,  3.0580134e+00, -3.9140179e+00,  4.0223894e-01,
          1.7356061e+00,  9.0976155e-01,  2.0956397e-02,  2.0190549e+00,
          4.5332021e-01, -1.6634842e+00, -4.8180079e-01,  2.0414692e-01,
         -5.9267312e-01, -1.4182589e+00, -9.7301149e-01,  5.1611459e-01,
          2.0727324e+00,  2.0064230e+00, -7.5027935e-02, -1.1723986e+00,
         -8.6943096e-01,  1.7028141e+00,  2.2190344

## IMDB Dataset
The [IMDB dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) consists of movie reviews that have been marked as positive or negative. (There is also a built-in dataset of [Reuters newswires](https://keras.io/datasets/#reuters-newswire-topics-classification) that have been classified by topic.)

In [134]:
(x_train, y_train), (x_test, y_test) = imdb.load_data()

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [135]:
imdb_offset = 3
imdb_map = dict((index + imdb_offset, word) for (word, index) in imdb.get_word_index().items())
imdb_map[0] = 'PADDING'
imdb_map[1] = 'START'
imdb_map[2] = 'UNKNOWN'
imdb_map

{34704: 'fawn',
 52009: 'tsukino',
 52010: 'nunnery',
 16819: 'sonja',
 63954: 'vani',
 1411: 'woods',
 16118: 'spiders',
 2348: 'hanging',
 2292: 'woody',
 52011: 'trawling',
 52012: "hold's",
 11310: 'comically',
 40833: 'localized',
 30571: 'disobeying',
 52013: "'royale",
 40834: "harpo's",
 52014: 'canet',
 19316: 'aileen',
 52015: 'acurately',
 52016: "diplomat's",
 25245: 'rickman',
 6749: 'arranged',
 52017: 'rumbustious',
 52018: 'familiarness',
 52019: "spider'",
 68807: 'hahahah',
 52020: "wood'",
 40836: 'transvestism',
 34705: "hangin'",
 2341: 'bringing',
 40837: 'seamier',
 34706: 'wooded',
 52021: 'bravora',
 16820: 'grueling',
 1639: 'wooden',
 16821: 'wednesday',
 52022: "'prix",
 34707: 'altagracia',
 52023: 'circuitry',
 11588: 'crotch',
 57769: 'busybody',
 52024: "tart'n'tangy",
 14132: 'burgade',
 52026: 'thrace',
 11041: "tom's",
 52028: 'snuggles',
 29117: 'francesco',
 52030: 'complainers',
 52128: 'templarios',
 40838: '272',
 52031: '273',
 52133: 'zaniacs',

## Train IMDB Word Vectors
The word vectors from the 1 billion words dataset might work for us when trying to classify the IMDB data. Word vectors trained on the IMDB data itself might work better, though.

In [136]:
train_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_train]
test_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_test]

In [137]:
# min count says to put any word that appears at least once into the vocabulary
# size sets the dimension of the output vectors
imdb_wv_model = word2vec.Word2Vec(train_sentences + test_sentences + ['UNKNOWN'], min_count=1, vector_size=100)

In [138]:
imdb_wordvec = imdb_wv_model.wv
del imdb_wv_model

In [139]:
cutoff = 500
x_train_padded = sequence.pad_sequences(x_train, maxlen=cutoff)
x_test_padded = sequence.pad_sequences(x_test, maxlen=cutoff)

## Classification With Word Vectors Trained With Model

Model definition. The embedding layer here learns the 100-dimensional vector embedding within the overall classification problem training. That is usually what we want, unless we have a bunch of un-tagged data that could be used to train word vectors but not a classification model.

\*\*Creates an embedding layer which will initialize randomly and then learn the word vectors\*\*

\*\*This is where `trainable = True`\*\*

In [140]:
not_pretrained_model = Sequential()
not_pretrained_model.add(Embedding(input_dim=len(imdb_map), output_dim=100, input_length=cutoff))
not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
not_pretrained_model.add(Flatten())
not_pretrained_model.add(Dense(units=128, activation='relu'))
not_pretrained_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
not_pretrained_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

Train and assess the model.

In [141]:
not_pretrained_model.fit(x_train_padded, y_train, epochs=1, batch_size=64)



<tensorflow.python.keras.callbacks.History at 0x1b0f70ee0>

In [154]:
not_pretrained_scores = not_pretrained_model.evaluate(x_test_padded, y_test)



## Classification With Pre-Trained Word Vectors
Using the details above about how the imdb dataset and the keras embedding layer represent words, define a model that uses the pre-trained word vectors from the imdb dataset rather than an embedding that keras learns as it goes along. You'll need to replace the embedding layer and feed in different training data.

Create a simple dict containing each word and an index

In [143]:
word_index = {}
word_index = dict(zip(imdb_map.values(), range(len(imdb_map))))
word_index
del word_index["\'l\'"] # this word throws the error 'key ''l'' not present' when creating embedding_matrix

Make a matrix with the pre-trained vectors for each index to be input to the Embedding layer

In [144]:
num_tokens = len(word_index) + 2
embedding_dim = 100

embedding_matrix = np.zeros((num_tokens, embedding_dim))

for word, i in word_index.items():
    embedding_matrix[i] = imdb_wordvec[word]

Create the Embedding layer with the pre-trained vectors  

In [145]:
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False, # keep the embeddings fixed
    input_length=cutoff
)

Define the model

In [146]:
pretrained_model = Sequential()
pretrained_model.add(embedding_layer)
pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
pretrained_model.add(Flatten())
pretrained_model.add(Dense(units=128, activation='relu', ))
pretrained_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
pretrained_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

Fit and assess the model

In [147]:
pretrained_model.fit(x_train_padded, y_train, epochs=1, batch_size=64)



<tensorflow.python.keras.callbacks.History at 0x1affd4250>

In [155]:
pretrained_scores = pretrained_model.evaluate(x_test_padded, y_test)



## Analysis

I used [this Keras article](https://keras.io/examples/nlp/pretrained_word_embeddings/) to help me create the pre-trained embedding layer.

Classifying the IMDB movie reviews as positive or negative using word vectors trained with the model worked pretty well. The loss was 0.26 and the accuracy 0.89. However, using pre-trained word vectors dropped the accuracy to 0.58 and the loss to 0.68.

If we randomly guessed between two classses, we would get it right half of the time. Thus, the worst possible accuracy is 0.5 (if 0 < accuracy < 0.5, the model is just reversed). So an accuracy of 0.58 tells us that the pre-trained model is more or less guessing positive or negative.

We would think that pre-trained word vectors wouldn't be so different. After all, the meaning of words doesn't change, right? The vector space shouldn't be that different between the pre-trained and trained-with-model approaches.

However, [keras embedding layers](https://keras.io/api/layers/core_layers/embedding/) use indexes to retrieve vectors that map to certain words. So when we use pre-trained word vectors and fit the model, the pre-trained indexes will be completely different than the IMDB `x_train` indexes. So when the model reads "pizza" (actually some number like 1632) it will map that to index 1632 in the pre-trained vectors, which might be "doorknob."

## End-to-end Model

From tutorial - would need to change things
```
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)

class_names[np.argmax(probabilities[0])]
```