### Look up embeddings from pre-trained
Our final strategy is to look up embeddings from pre-trained networks. The simplest way to do this
with the current examples **is to just set the _trainable_ parameter of the embedding layer to False**. This
ensures that backpropagation will not update the weights on the embedding layer:

        model.add(Embedding(vocab_sz, EMBED_SIZE, input_length=maxlen,
                            weights=[embedding_weights],
                            trainable=False))
        
        model.add(SpatialDropout1D(Dropout(0.2)))

However, in general, this is not how you would use pre-trained embeddings in your code. Typically,
it involves:
- preprocessing your dataset to create word vectors by looking up words in one of the pretrained
models
- using this data to train some other model. 

The second model would not contain an Embedding layer, and may not even be a deep learning network.

The following example describes a dense network that takes as its input a vector of size 100,
representing a sentence, and outputs a 1 or 0 for positive or negative sentiment. Our dataset is still the
one from the UMICH S1650 sentiment classification competition with around 7,000 sentences.

**We begin with the imports**

In [1]:
# from gensim.models import KeyedVectors                                       #for word2vec
from keras.layers.core import Dense, SpatialDropout1D, Dropout        
#from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
#from keras.layers.pooling import GlobalMaxPooling1D                          #for word2vec
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import collections
#import matplotlib.pyplot as plt
import nltk
import numpy as np     
import codecs

from tensorflow.keras.callbacks import TensorBoard
from time import gmtime, strftime
import datetime, os
import tensorflow as tf
import time

Using TensorFlow backend.


**Set the random seed for repeatability**

In [2]:
np.random.seed(42)

**Create folder to TensroBoard save the graphs**

In [20]:
NAME = "lookupglove{}".format(int(time.time()))
tensorboard = TensorBoard(log_dir='logs/{}'.format(NAME))

**set some constant values**

In order to create the 100-dimensional vectors for each sentence, we add up the **GloVe 100-dimensional
vectors** for the words in the sentence, so we choose the glove.6B.100d.txt file 
(https://www.kaggle.com/terenceliu4444/glove6b100dtxt)

In [4]:
INPUT_FILE = "data/umich-sentiment-train.txt"
GLOVE_MODEL = "data/glove.6B.100d.txt"
VOCAB_SIZE = 5000
EMBED_SIZE = 100
BATCH_SIZE = 64
NUM_EPOCHS = 10

The next block reads the sentences and creates a word frequency table. From this, the most common
5000 tokens are selected and lookup tables (from word to word index and back) are created. In
addition, we create a pseudo-token _UNK_ for tokens that do not exist in the vocabulary. Using these
lookup tables, we convert each sentence to a sequence of word IDs, padding these sequences so that
all sequences are of the same length (the maximum number of words in a sentence in the training set).
We also convert the labels to categorical format:

**Reading data**: reads the sentences and creates a word frequency table

In [5]:
counter = collections.Counter()
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
maxlen = 0
for line in fin:
    _, sent = line.strip().split("\t")
    words = [x.lower() for x in nltk.word_tokenize(sent)]   # lower case of words
    if len(words) > maxlen:
        maxlen = len(words)                                 # We pad each of our sentences to predetermined 
                                                            # length maxlen (in this case the number of words in the
                                                            # longest sentence in the training set)
    for word in words:
        counter[word] += 1
fin.close()

**creating vocabulary**: the most common
5000 tokens are selected and lookup tables (from word to word index and back) are created. In
addition, we create a pseudo-token _UNK_ for tokens that do not exist in the vocabulary.

In [6]:
word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
    word2index[word[0]] = wid + 1
vocab_sz = len(word2index) + 1
index2word = {v: k for k, v in word2index.items()}
index2word[0] = "_UNK_"

**creating word sequences**:  Using these
lookup tables, we convert each sentence to a sequence of word IDs, padding these sequences so that
all sequences are of the same length (the maximum number of words in a sentence in the training set).
We also convert the labels to categorical format

In [7]:
ws, ys = [], []
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
for line in fin:
    label, sent = line.strip().split("\t")
    ys.append(int(label))
    words = [x.lower() for x in nltk.word_tokenize(sent)]
    wids = [word2index[word] for word in words]
    ws.append(wids)
fin.close()
W = pad_sequences(ws, maxlen=maxlen)
Y = np_utils.to_categorical(ys)

**Load the GloVe vectors into a dictionary**

In [8]:
word2emb = collections.defaultdict(int)
fglove = open(GLOVE_MODEL, "rb")
for line in fglove:
    cols = line.strip().split()
    word = cols[0].decode('utf-8')
    embedding = np.array(cols[1:], dtype="float32")
    word2emb[word] = embedding
fglove.close()

"""
# for word2vec
word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
    word2index[word[0]] = wid + 1
vocab_sz = len(word2index) + 1
index2word = {v: k for k, v in word2index.items()}
"""

'\n# for word2vec\nword2index = collections.defaultdict(int)\nfor wid, word in enumerate(counter.most_common(VOCAB_SIZE)):\n    word2index[word[0]] = wid + 1\nvocab_sz = len(word2index) + 1\nindex2word = {v: k for k, v in word2index.items()}\n'

**Tranfering Embeddings**:
The next block looks up the words for each sentence from the word ID matrix _W_ and populates a
matrix _E_ with the corresponding embedding vector. These embedding vectors are then added to create
a sentence vector, which is written back into the _X_ matrix. The output of this code block is the matrix _X_
of size (_num_records_ and _EMBED_SIZE_):

In [14]:
X = np.zeros((W.shape[0], EMBED_SIZE))
for i in range(W.shape[0]):
    E = np.zeros((EMBED_SIZE, maxlen))
    words = [index2word[wid] for wid in W[i].tolist()]
    for j in range(maxlen):
        E[:, j] = word2emb[words[j]]    
    X[i, :] = np.sum(E, axis=1)

"""
# for word2vec
xs, ys = [], []
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
for line in fin:
    label, sent = line.strip().split("\t")
    ys.append(int(label))
    words = [x.lower() for x in nltk.word_tokenize(sent)]
    wids = [word2index[word] for word in words]
    xs.append(wids)
fin.close()
X = pad_sequences(xs, maxlen=maxlen)
Y = np_utils.to_categorical(ys)
"""

'\n# for word2vec\nxs, ys = [], []\nfin = codecs.open(INPUT_FILE, "r", encoding=\'utf-8\')\nfor line in fin:\n    label, sent = line.strip().split("\t")\n    ys.append(int(label))\n    words = [x.lower() for x in nltk.word_tokenize(sent)]\n    wids = [word2index[word] for word in words]\n    xs.append(wids)\nfin.close()\nX = pad_sequences(xs, maxlen=maxlen)\nY = np_utils.to_categorical(ys)\n'

**split the data into 70/30**:We have now preprocessed our data using the pre-trained model and are ready to use it to train and evaluate our final model. Let us split the data into 70/30 training/test as usual

In [15]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.3, random_state=42)

**Neural Network Model:** The network we will train for doing the sentiment analysis task is a simple dense network. We compile it with a categorical cross-entropy loss function and the Adam optimizer, and train it with the
sentence vectors that we built out of the pre-trained embeddings. Finally, we evaluate the model on
the 30% test set.

In [16]:
model = Sequential()
model.add(Dense(32, input_dim=EMBED_SIZE, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(2, activation="softmax"))

**Compline CNN Model**

In [17]:
model.compile(optimizer="adam", loss="categorical_crossentropy",metrics=["accuracy"])

**Evaluate the Trained Model**

In [21]:
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
                    epochs=NUM_EPOCHS,
                    callbacks=[tensorboard],
                    validation_data=(Xtest, Ytest))

# evaluate model
score = model.evaluate(Xtest, Ytest, verbose=1)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score[0], score[1]))

Train on 4960 samples, validate on 2126 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.088, accuracy: 0.972


**Validation and Accuracy Plots**
<img src="LookupPreTainGlove1.JPG">

**Structure of the Neural Network Model**
<img src="LookupPreTainGlove2.JPG">