## Predicting with Neural Networks

In the space of NLP the word2vec model refers to a numerical vector representation of the meaning of words. The singular numerical representations of each word when modeling with this method are somewhat arbitrary, but they can provide a way by which words of similar meaning in a set of word vectors are expected to have vectors that are close to each other.

Creating such vectors can be a good starting point either for deriving the meaning of unknown words, or deriving new value from the computed meanings. There are a number of methods that utilize neural networks and their back propagation in order to derive word vector representations. In our neural network approach we will first aim to use the skip-gram neural network design to derive a set of word vectors over our tweet space, and then we'll feed our word vectors into an algorithm for determining the overall sentiment (disaster/safe) of our tweets.

### Word Vectors

We'll be implementing the skip-gram algorithm over our corpus to derive our word vectors. In the skip gram method, words in a corpus are paired together with the output of 1 if they appear close to each other and 0 if they are just random pairs, with embedding layers linking a vocabulary size input to those binary outputs. The embedding layers will then become your word2vec vectors to potentially be applied to another model.

It's important to train word vectors over the whole corpus, because we'll be training our classifier over just part of the corpus, and can't guarantee everything in the testing set will be in the training set, so we can at least get some predictive capability from the word vectors if we train the vectors we'll input on the whole corpus.

In [1]:
# First load our corpus of previously processed tweets
import pandas as pd

tweet_df = pd.read_csv('../data/processed_kaggle_training.csv')
tweet_df['processed_text'] = tweet_df['processed_text'].astype(str)
texts = tweet_df.processed_text.values

In [2]:
# prepare for skipgram word2vec with keras helpers
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import skipgrams

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

word2id = tokenizer.word_index
id2word = {v:k for k, v in word2id.items()}

wids = [[word2id[w] for w in text_to_word_sequence(text)] for text in texts]
skip_grams = [skipgrams(wid, vocabulary_size=len(word2id)) for wid in wids]

Using TensorFlow backend.


In [3]:
# construct our keras model for skipgrams
from keras.layers import Dot, Input
from keras.layers.core import Dense, Reshape
from keras.layers.embeddings import Embedding
from keras.models import Model

VOCAB_SIZE = len(word2id) + 1
EMBED_SIZE = 100

word_input = Input(name='word_input',shape=[1])
layer = Embedding(
    VOCAB_SIZE, EMBED_SIZE, embeddings_initializer="glorot_uniform",
    input_length=1)(word_input)
word_layer = Reshape((EMBED_SIZE,))(layer)

context_input = Input(name='context_input',shape=[1])
layer = Embedding(
    VOCAB_SIZE, EMBED_SIZE, embeddings_initializer="glorot_uniform",
    input_length=1)(context_input)
context_layer = Reshape((EMBED_SIZE,))(layer)

merge_layer = Dot(axes=1)([word_layer, context_layer])
output = Dense(1, init="glorot_uniform", activation="sigmoid")(merge_layer)
sg_model = Model(inputs = [word_input, context_input], outputs = output)
    
sg_model.compile(loss="mean_squared_error", optimizer="adam")

Instructions for updating:
Colocations handled automatically by placer.




In [4]:
# and now train our prepared skipgrams on our model
import numpy as np
for epoch in range(1, 6):
    loss = 0
    for i, elem in enumerate(skip_grams):
        if len(elem[0]) == 0:
            continue
        pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
        pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
        labels = np.array(elem[1], dtype='int32')
        X = [pair_first_elem, pair_second_elem]
        Y = labels
        loss += sg_model.train_on_batch(X,Y)

    print('Epoch:', epoch, 'Loss:', loss)

Instructions for updating:
Use tf.cast instead.
Epoch: 1 Loss: 1546.904946116265
Epoch: 2 Loss: 1097.2918282193132
Epoch: 3 Loss: 723.0027144331598
Epoch: 4 Loss: 435.4063223999115
Epoch: 5 Loss: 268.8751482185553


In [5]:
# still seeing a pretty big drop in the loss at the end there, lets run
# a few more epochs before saving the embedding
for epoch in range(6, 10):
    loss = 0
    for i, elem in enumerate(skip_grams):
        if len(elem[0]) == 0:
            continue
        pair_first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
        pair_second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
        labels = np.array(elem[1], dtype='int32')
        X = [pair_first_elem, pair_second_elem]
        Y = labels
        loss += sg_model.train_on_batch(X,Y)

    print('Epoch:', epoch, 'Loss:', loss)

Epoch: 6 Loss: 183.66639455841906
Epoch: 7 Loss: 124.08219412733192
Epoch: 8 Loss: 93.6779270734111
Epoch: 9 Loss: 69.78028212264012


In [49]:
# We're in the realm of overfitting now, lets save these embeddings
tok_words = [v for v in id2word.values()]
tok_words.insert(0, 'PAD')

weights = sg_model.layers[2].get_weights()[0][:]
wdf = pd.DataFrame(weights, index=tok_words)
wdf.to_csv('word_vectors.csv')

### Classification

Now that we have some trained word vectors over our entire corpus, we can plug these in to our sequential classification model. This is an interesting deviation from traditional machine learning methods that typically have to operate solely on a bag of words input. Instead our machine learning should be able to utilize the context of words for extra predictive capability, though we'll have to compare this method and our traditional methods afterwards to determine which is superior on this input.

The particular mechanism by which sequential context information is kept relevant will be obscured somewhat to us behind the Keras implemented Long-Short Term Memory cell. For our purposes we can know they're an implementation in the Keras layer that both keeps some memory of past values and works to help prevent the shrinking/exploding gradient problem of just stacking dense layers to predict the output of sequential input. We'll also be utilizing a random dropout layer provided by keras to prevent overfitting of our model.

In [108]:
from keras.layers import Activation, LSTM, Dropout

# we'll be padding all tweets to the max tweet's token length
# for consistent sequence length
INPUT_LEN = max([len(wid_v) for wid_v in wids])

# define the model
inputs = Input(name="lstm_input", shape=[INPUT_LEN])
layer = Embedding(
    VOCAB_SIZE, EMBED_SIZE, weights=[weights], input_length=INPUT_LEN,
    trainable=False
    )(inputs)
layer = LSTM(100)(layer)
layer = Dense(200, activation='relu')(layer)
layer = Dropout(0.2)(layer)
layer = Dense(1, name="lstm_output")(layer)
layer = Activation("sigmoid")(layer)
lstm_model = Model(inputs=inputs, outputs=layer)
lstm_model.summary()

Model: "model_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_input (InputLayer)      (None, 26)                0         
_________________________________________________________________
embedding_13 (Embedding)     (None, 26, 100)           1181200   
_________________________________________________________________
lstm_11 (LSTM)               (None, 100)               80400     
_________________________________________________________________
dense_12 (Dense)             (None, 200)               20200     
_________________________________________________________________
dropout_11 (Dropout)         (None, 200)               0         
_________________________________________________________________
lstm_output (Dense)          (None, 1)                 201       
_________________________________________________________________
activation_13 (Activation)   (None, 1)                 0  

In [114]:
from keras.optimizers import RMSprop

# compile the model
lstm_model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])

Now we must make our sequences to feed to this model, as well as our output vectors.

In [110]:
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

X = pad_sequences(wids, maxlen = INPUT_LEN, padding='post', value=0)
y = tweet_df['target']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

And finally train the model:

In [111]:
from keras.callbacks import EarlyStopping

lstm_model.fit(X_train,
               y_train,
               batch_size=256,
               epochs=50,
               validation_split=0.2)

Train on 4872 samples, validate on 1218 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x7f12b4672a50>

In [112]:
# to see how we did, apply binary mask over our sigmoid output
preds = [1 if x >= 0.5 else 0 for x in lstm_model.predict(X_test)]

In [113]:
from sklearn.metrics import accuracy_score

# and measure binary accuracy
accuracy_score(y_test, preds)

0.7806959947472094

While our model is scoring significantly better than random chance on the test set, we're not in the realm of certaintly that we might have liked. We may still be running into the trouble of too small sequences to form predictions on in our tweet space. We'll have to compare against our traditional methods using TF-IDF.