## Good old word vectors

So let's start off by doing something classic, we will be looking at word vectors, or a learned representation of the meanings of words. 

This is important because I want to show you how to go about doing representational learning in a couple of ways. The first way is to solve a prediction problem. This is often the most common and most useful of approaches. We are in a situation where we want to solve a prediciton problem and the features that we have are qualitative and interact with one another: eg. language. Another common example might be recommending movies using collaborative filtering. 

So we don't have an easy/intuitive way to represent our qualitative features, so we will learn the representations. As a byproduct we actually get great representations that are geared at your prediction problem.

In [1]:
# we initialize some hyperparams
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
INDEX_FROM=3
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

We will be trying to predict whether movie reviews are good or bad, thus we will be using the an IMDB dataset (it might take some time to download the movies):

In [2]:
from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=MAX_NB_WORDS, index_from=INDEX_FROM)

Using TensorFlow backend.


Notice that the data has already been converted to a form that we like, each word is now an index (we will talk about why this is so important later).

In [3]:
x_train[0][:5]

[1, 14, 22, 16, 43]

That being said we can convert it back to normal by using the word index dictionary. Notice that our first three words in the dictionary are: padding, start char, and unknown word (like a proper noun).

In [4]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be p

We will not be using the typical DL for NLP tool of the LSTM this time, so we will need each review to be of equal length, so we will pad the reviews that are too short:

In [5]:
from keras.preprocessing.sequence import pad_sequences

x_train = pad_sequences(x_train, maxlen=MAX_SEQUENCE_LENGTH)

Our labels will just be 0 or 1 (good or bad):

In [6]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

And finally we will make our network. Check out the embedding layer. The really cool thing about the embedding layer is that it is just one big weight matrix. Where each index is an embedding, or in this case a word. So the reason the input needs to be an index, is so we know where to look for it in the weight matrix!

So we feed the embeddings into a CNN to predict whether the review is good or bad.

In [7]:
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding

embedding_layer = Embedding(MAX_NB_WORDS,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(1, activation='sigmoid')(x)

There is one final thing we will need to do before training the model. And this is secific to tensorflow and thus keras. Tensorflow allows us to visualize embeddings, but it needs a little more information about the embeddings that it will visualize: specifically which index is which word. That is what we output below:

In [8]:
import csv

with open('word_metadata.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
       
    for key,value in sorted(id_to_word.items()):
        writer.writerow([value.encode('utf8')])


In [9]:
from keras.models import Model
from keras.callbacks import TensorBoard

model = Model(sequence_input, preds)

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

embedding_metadata = {
    embedding_layer.name: 'word_metadata.csv'
}

model.fit(x_train, y_train,
          batch_size=128,
          epochs=10,
          validation_split=VALIDATION_SPLIT,
          callbacks=[TensorBoard(log_dir='word_reps', embeddings_freq=1, embeddings_metadata=embedding_metadata)])

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x12f4241d0>

That is it. Let's check out tensorboard! 