Many thanks to Jason Brownlee's [blog post](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/) on this topic, which was extremely helpful in getting this up and running considering I have not had a chance to do the Keras specialization yet.

In [10]:
import keras
import pandas as pd
import numpy as np
from sklearn.externals import joblib
from sklearn import metrics
shows = pd.read_pickle("no_na_pre2017_v4.pkl")

In [11]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('/Users/perrypetra-wong/Dropbox/Thinkful/Lessons/Capstone/GoogleNews-vectors-negative300.bin', binary=True)
#model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

For our text documents, let's start with a concatenation of the three text columns, and define our labels: cancellation.

In [12]:
docs = shows['tagline'].astype('O') + shows['synopsis'].astype('O')
docs[0]

'A sitcom based on the Twitter feed "S*** My Dad Says", which was created by Justin Halpern and is filled with quotes said by his father.Ed is an opinionated and divorced 72-year-old man. His two sons - Henry and Vince - are both adults and over the years have become very accustomed to his unsolicited rants, which are often politically incorrect.When Henry, a struggling writer who also blogs, can\'t afford to pay his rent any longer, he\'s forced to move back in with his dad, which creates more issues in their already tricky father-son relationship.During one of Henry\'s job interviews, Ed interrupts with one of his usual irrational phone calls. This catches the ear of the interviewer, who ends up hiring Henry, but also forces him to remain living with his dad so he can keep writing about his rantings.'

In [13]:
labels = shows['two_season_cancel']

Limiting our word vectors to just those present in the text.

In [14]:
import string

embeddings_index = {}

for wordstring in docs:
    
    if isinstance(wordstring,float):
        pass
    else:
        # In case of null
        punct_remover = str.maketrans('', '', string.punctuation)

        # Removing punctuation and splitting apart words
        punct_stripped = wordstring.translate(punct_remover)
        wordlist = punct_stripped.split()

        # If word is in Google model, add it its corresponding vec to our running
        # subset dict
        for word in wordlist:
            word = word.lower()
            if word in model:
                if word not in embeddings_index:
                    embeddings_index[word] = model[word]
                    
del model

In [15]:
#joblib.dump(embeddings_index,"Extras/embeddings_index.pkl")
embeddings_index = joblib.load("Extras/embeddings_index.pkl")

In [16]:
# Tokenize the texts to save mappings from words to integers
t = keras.preprocessing.text.Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1

encoded_docs = t.texts_to_sequences(docs)

In [17]:
print(encoded_docs[0:1])

[[2, 253, 80, 14, 1, 2994, 2995, 346, 1356, 390, 3867, 83, 59, 659, 26, 1357, 5648, 4, 7, 1218, 8, 5649, 2046, 26, 9, 146, 2996, 7, 18, 3868, 4, 336, 3869, 127, 135, 96, 9, 48, 1130, 660, 4, 2997, 21, 155, 907, 4, 85, 1, 68, 53, 148, 491, 5650, 5, 9, 3870, 5651, 83, 21, 406, 2998, 2999, 36, 660, 2, 289, 908, 10, 79, 5652, 1544, 5653, 5, 2420, 9, 5654, 347, 1545, 492, 262, 5, 198, 70, 6, 8, 9, 390, 83, 2421, 93, 247, 6, 12, 1219, 3000, 146, 139, 212, 185, 41, 3, 5655, 128, 1045, 2996, 5656, 8, 41, 3, 9, 5657, 5658, 3001, 1546, 27, 3002, 1, 5659, 3, 1, 5660, 10, 379, 39, 5661, 660, 35, 79, 493, 62, 5, 1131, 105, 8, 9, 390, 254, 24, 92, 255, 2047, 75, 9, 5662]]


In [18]:
# We need to pad the length of these to a fixed amount that won't be exceeded
max_length = 25
padded_docs = keras.preprocessing.sequence.pad_sequences(encoded_docs, maxlen=max_length, padding='post')

Next we need to create a matrix of one embedding per word in the training set. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the corresponding vector in the word2vec subset we've created above.

In [19]:
# Weight matrix for each word in training doc
embedding_matrix = np.zeros((vocab_size, 300))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [20]:
e = keras.layers.Embedding(vocab_size,300,weights=[embedding_matrix],input_length=max_length,trainable=False)

# Define the model
model = keras.Sequential()
model.add(e)
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1,activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 300)           3176100   
_________________________________________________________________
flatten_1 (Flatten)          (None, 7500)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7501      
Total params: 3,183,601
Trainable params: 7,501
Non-trainable params: 3,176,100
_________________________________________________________________
None


In [21]:
model.fit(padded_docs,labels,epochs=50,verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x235813f60>

In [22]:
y_pred = model.predict(padded_docs)

In [23]:
metrics.accuracy_score(labels,np.round(y_pred))

0.9992401215805471

Obviously it performs quite well on the training set, but this was more to just check that the model worked. Let's now run it in a cross-validated fashion in the same way that we've done for our other models.

In [24]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

# reindexing docs and labels to avoid any errors there
docs.index = np.arange(0,len(docs))
docs = docs.reindex()
labels.index = (np.arange(0,len(labels)))

for train,test in kf.split(docs):
      
    # Train and test
    docs_train = docs[train]
    labels_train = labels[train]
    docs_test = docs[test]
    labels_test = labels[test]
    
    # Tokenize the texts to save mappings from words to integers
    t = keras.preprocessing.text.Tokenizer()
    t.fit_on_texts(docs_train)
    vocab_size = len(t.word_index) + 1

    encoded_docs = t.texts_to_sequences(docs_train)
    
    # Capping the lengths
    max_length = 25
    padded_docs = keras.preprocessing.sequence.pad_sequences(encoded_docs, maxlen=max_length, padding='post')
    
    # Weight matrix for each word in training doc
    embedding_matrix = np.zeros((vocab_size, 300))
    for word, i in t.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    
    e = keras.layers.Embedding(vocab_size,300,weights=[embedding_matrix],input_length=max_length,trainable=False)
    
    # Define the model
    model = keras.Sequential()
    model.add(e)
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(1,activation='sigmoid'))

    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    
    model.fit(padded_docs,labels_train,epochs=50,verbose=0)
    
    # Test set
    encoded_test = t.texts_to_sequences(docs_test)
    padded_docs_test = keras.preprocessing.sequence.pad_sequences(encoded_test, maxlen=max_length, padding='post')
    
    y_pred = model.predict(padded_docs_test)
    loss,accuracy = model.evaluate(padded_docs_test,labels_test)
    
    print('Accuracy on first fold: {0:.2%}'.format(accuracy))

Accuracy on first fold: 56.04%
Accuracy on first fold: 53.08%
Accuracy on first fold: 54.11%


In [25]:
# Confirming how the accuracy seems to be calculated
metrics.accuracy_score(np.round(y_pred),labels_test)

0.541095890410959

Sadly, we're still looking at pretty poor results. Even though our neural net can easily converge on 100% training accuracy within 50 epochs, accuracy falls apart when applied to holdouts. As such, I don't think adding layers to our net and optimizing would be very fruitful, since we seem to be only hurting our predictive value when we attempt more complex learning from the training documents. In other words, we're observing the bias-variance tradeoff.

This is not to say that NLP doesn't have some cool applications in this dataset. Out of curiosity, let's see how this network does when assigned to something else, like telling whether or not a show is a reality show!

In [26]:
labels = shows['Reality']

In [30]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

# reindexing docs and labels to avoid any errors there
docs.index = np.arange(0,len(docs))
docs = docs.reindex()
labels.index = (np.arange(0,len(labels)))

for train,test in kf.split(docs):
      
    # Train and test
    docs_train = docs[train]
    labels_train = labels[train]
    docs_test = docs[test]
    labels_test = labels[test]
    
    # Tokenize the texts to save mappings from words to integers
    t = keras.preprocessing.text.Tokenizer()
    t.fit_on_texts(docs_train)
    vocab_size = len(t.word_index) + 1

    encoded_docs = t.texts_to_sequences(docs_train)
    
    # Capping the lengths
    max_length = 35
    padded_docs = keras.preprocessing.sequence.pad_sequences(encoded_docs, maxlen=max_length, padding='post')
    
    # Weight matrix for each word in training doc
    embedding_matrix = np.zeros((vocab_size, 300))
    for word, i in t.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    
    e = keras.layers.Embedding(vocab_size,300,weights=[embedding_matrix],input_length=max_length,trainable=False)
    
    # Define the model
    model = keras.Sequential()
    model.add(e)
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(30,activation='relu'))
    model.add(keras.layers.Dense(1,activation='sigmoid'))


    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    
    model.fit(padded_docs,labels_train,epochs=50,verbose=0)
    
    # Test set
    encoded_test = t.texts_to_sequences(docs_test)
    padded_docs_test = keras.preprocessing.sequence.pad_sequences(encoded_test, maxlen=max_length, padding='post')
    
    y_pred = model.predict_classes(padded_docs_test)
    loss,accuracy = model.evaluate(padded_docs_test,labels_test)
    
    print('Accuracy on first fold: {0:.2%}'.format(accuracy))

Accuracy on first fold: 78.82%
Accuracy on first fold: 81.32%
Accuracy on first fold: 80.14%


In [19]:
shows['primary_genre'].value_counts().apply(lambda x:x/len(shows))

Drama        0.402736
Comedy       0.308511
Reality      0.238602
Game Show    0.025836
Talk         0.022796
Sci-fi       0.001520
Name: primary_genre, dtype: float64

In [20]:
shows['Reality'].mean()

0.23860182370820668

Pretty impressive for a quick solution, with 80% accuracy (even if only 5% above baseline). Let's see how it does with all genres.

In [32]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

# Need to properly encode the outputs
labels = pd.Series(encoder.fit_transform(shows['primary_genre']))
encoded_labels = encoder.fit_transform(shows['primary_genre'])
one_hot_labels = keras.utils.np_utils.to_categorical(encoded_labels)
one_hot_labels = np.array(pd.get_dummies(encoded_labels))

In [34]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)

# reindexing docs and labels to avoid any errors there
docs.index = np.arange(0,len(docs))
docs = docs.reindex()

for train,test in kf.split(docs):
      
    # Train and test
    docs_train = docs[train]
    labels_train = one_hot_labels[train]
    docs_test = docs[test]
    labels_test = one_hot_labels[test]
    
    # Tokenize the texts to save mappings from words to integers
    t = keras.preprocessing.text.Tokenizer()
    t.fit_on_texts(docs_train)
    vocab_size = len(t.word_index) + 1

    encoded_docs = t.texts_to_sequences(docs_train)
    
    # Capping the lengths
    max_length = 40
    padded_docs = keras.preprocessing.sequence.pad_sequences(encoded_docs, maxlen=max_length, padding='post')
    
    # Weight matrix for each word in training doc
    embedding_matrix = np.zeros((vocab_size, 300))
    for word, i in t.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    
    e = keras.layers.Embedding(vocab_size,300,weights=[embedding_matrix],input_length=max_length,trainable=False)
    
    # Define the model
    model = keras.Sequential()
    model.add(e)
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(30))
    model.add(keras.layers.Dense(30))
    model.add(keras.layers.Dense(6,activation='softmax'))

    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
    
    model.fit(padded_docs,labels_train,epochs=50,verbose=0)
    
    # Test set
    encoded_test = t.texts_to_sequences(docs_test)
    padded_docs_test = keras.preprocessing.sequence.pad_sequences(encoded_test, maxlen=max_length, padding='post')
    
    y_pred = model.predict_classes(padded_docs_test)
    loss,accuracy = model.evaluate(padded_docs_test,labels_test)
    
    print('Accuracy on first fold: {0:.2%}'.format(accuracy))

Accuracy on first fold: 63.55%
Accuracy on first fold: 64.69%
Accuracy on first fold: 63.24%


Very cool. Almost a 2/3 accuracy vs. a baseline guess of the dominant class, which would have an accuracy of ~40%. Again, this doesn't help our cause of predicting cancellation, which seems like it's just not solvable with NLP given the current information we have.

In [23]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 40, 300)           2551800   
_________________________________________________________________
flatten_10 (Flatten)         (None, 12000)             0         
_________________________________________________________________
dense_12 (Dense)             (None, 20)                240020    
_________________________________________________________________
dense_13 (Dense)             (None, 6)                 126       
Total params: 2,791,946
Trainable params: 240,146
Non-trainable params: 2,551,800
_________________________________________________________________
