### Fine-tuning learned embeddings from word2vec

We will use the same network as the one we used to learn our embeddings from
scratch. In terms of code, the only major difference is an extra block of code to load the word2vec
model and build up the weight matrix for the embedding layer.

In [1]:
from gensim.models import KeyedVectors
from keras.layers.core import Dense, SpatialDropout1D
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import collections
import nltk
import numpy as np
import os
import codecs

from tensorflow.keras.callbacks import TensorBoard
from time import gmtime, strftime
import datetime, os
import tensorflow as tf
import time

Using TensorFlow backend.


Create folder to TensroBoard save the graphs

In [2]:
NAME = "word2vecEmb{}".format(int(time.time()))
tensorboard = TensorBoard(log_dir='logs/{}'.format(NAME))

Next up is setting up the constants. The only difference here is that we reduced the NUM_EPOCHS setting
from 20 to 10. **Recall that** initializing the matrix with values from a pre-trained model tends to set them
to good values that converge faster.

**Download file** : https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

In [3]:
INPUT_FILE = "data/umich-sentiment-train.txt"
WORD2VEC_MODEL = "data/GoogleNews-vectors-negative300.bin.gz"                    #this file have 1.5G
VOCAB_SIZE = 5000
EMBED_SIZE = 300
NUM_FILTERS = 256
NUM_WORDS = 3
BATCH_SIZE = 64
NUM_EPOCHS = 10

Extracts the words from the dataset and creates a vocabulary of the most frequent terms

In [4]:
counter = collections.Counter()
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
maxlen = 0
for line in fin:
    _, sent = line.strip().split("\t")
    words = [x.lower() for x in nltk.word_tokenize(sent)]   # lower case of words
    if len(words) > maxlen:
        maxlen = len(words)                                 # We pad each of our sentences to predetermined 
                                                            # length maxlen (in this case the number of words in the
                                                            # longest sentence in the training set)
    for word in words:
        counter[word] += 1
fin.close()

Parses the dataset again to create a list of padded word lists

In [5]:
word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
    word2index[word[0]] = wid + 1
# Adding one because UNK.
# It means representing words that are not seen in the vocubulary
vocab_sz = len(word2index) + 1
index2word = {v: k for k, v in word2index.items()}

It also converts the labels to categorical format.

In [6]:
xs, ys = [], []
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
for line in fin:
    label, sent = line.strip().split("\t")
    ys.append(int(label))
    words = [x.lower() for x in nltk.word_tokenize(sent)]
    wids = [word2index[word] for word in words]
    xs.append(wids)
fin.close()
X = pad_sequences(xs, maxlen=maxlen)
Y = np_utils.to_categorical(ys)

Finally, it splits the data into a training and a test set.

In [7]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.3, random_state=42)

**The next block loads up the word2vec model from a pre-trained model.**

This model is trained with about 10 billion words of Google News articles and has a vocabulary size of 3 million.

The dimensions of the embedding_weights matrix is *vocab_sz* and *EMBED_SIZE*. The vocab_sz is one more than the
maximum number of unique terms in the vocabulary, the additional pseudo-token _UNK_ representing
words that are not seen in the vocabulary

In [8]:
# load word2vec model
# it takes a long time
word2vec = KeyedVectors.load_word2vec_format(WORD2VEC_MODEL, binary=True)
embedding_weights = np.zeros((vocab_sz, EMBED_SIZE))
for word, index in word2index.items():
    try:
        embedding_weights[index, :] = word2vec[word]
    except KeyError:
        pass

The difference in this block from our previous example is that we initialize
the weights of the embedding layer with the *embedding_weights* matrix we built in the previous block:

In [9]:
model = Sequential()

model.add(Embedding(vocab_sz, 
                    EMBED_SIZE,
                    input_length=maxlen,
                    weights=[embedding_weights]))
model.add(SpatialDropout1D(0.2))

model.add(Conv1D(filters=NUM_FILTERS,
                 kernel_size=NUM_WORDS,
                 activation="relu"))

model.add(GlobalMaxPooling1D())

model.add(Dense(2, activation="softmax"))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


We then compile our model with the categorical cross-entropy loss function and the Adam optimizer,
and train the network with batch size 64 and for 10 epochs, then evaluate the trained model

In [10]:
model.compile(loss="categorical_crossentropy", 
              optimizer="adam",
              metrics=["accuracy"])

history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
                    epochs=NUM_EPOCHS,
                    callbacks=[tensorboard],
                    validation_data=(Xtest, Ytest))

# evaluate model
score = model.evaluate(Xtest, Ytest, verbose=1)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score[0], score[1]))

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 4960 samples, validate on 2126 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.014, accuracy: 0.996


**Validation and Accuracy Plots**
<img src="tensorembed1.jpg">

**Structure of the Neural Network Model**
<img src="tensorembed2.JPG">