In [1]:
%autosave 0

Autosave disabled


## Assignment 2 - Movie Classification, the sequel
![](https://images-na.ssl-images-amazon.com/images/S/sgp-catalog-images/region_US/paramount-01376-Full-Image_GalleryBackground-en-US-1484000188762._RI_SX940_.jpg)


#### In this assignment, we will learn a little more about word2vec and then use the resulting vectors to make some predictions.

We will be working with a movie synopsis dataset, found here: http://www.cs.cmu.edu/~ark/personas/

The overall goal should sound a little familiar - based on the movie synopses, we will classify movie genre. Some of your favorites should be in this dataset, and hopefully, based on the genre specific terminology of the movie synopses, we will be able to figure out which movies are which type.

### Task 1: clean your dataset!

For your input data:

1. Find the top 10 movie genres
2. Remove any synopses that don't fit into these genres
3. Take the top 10,000 reviews in terms of "Movie box office revenue"

Congrats, you've got a dataset! For each movie, some of them may have multiple classifications. To deal with this, you'll have to look at the Reuters dataset classification code that we used previously and possibly this example: https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py

We want to use categorical cross-entropy as our loss function (or a one vs. all classifier in the case of SVM) because our data will potentially have multiple classes!

In [2]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.pipeline import Pipeline
from sklearn import metrics
import gensim
import word2vec

import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers import LSTM
from keras.regularizers import l2

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
dataPath = '../data/MovieSummaries/'
movieHeader = ['wikiID', 'freebaseID', 'name', 'releaseDate', 'revenue',
               'runtime', 'languages', 'countries', 'genres']
movieDat = pd.read_csv(dataPath + 'movie.metadata.tsv', delimiter = '\t',
                      header = None, names = movieHeader)
synopsisDat = pd.read_csv(dataPath + 'plot_summaries.txt', delimiter = '\t',
                      header = None, names = ['wikiID', 'synopsis'])

In [4]:
# To find top genres, will split the genres into their own columns
# since one move can be multiple genres and then sum those columns
# to find the max 10

### Step 1 -- Convert Genres into a list of genres
def cleanGenres(genreDat):
    clean = [re.findall(r'"\S+": "(.+)"', x) for x in genreDat.split(',')]
    return [item for sublist in clean for item in sublist]
    
movieDat['genres_clean'] = movieDat.genres.apply(cleanGenres)

### Step 2 --- "One Hot Encode" that list and then join back
mlb = MultiLabelBinarizer()
movieDat = movieDat.join(pd.DataFrame(mlb.fit_transform(movieDat['genres_clean']),
                          columns=mlb.classes_,
                          index=movieDat.index))

### Step 3 --- Find genres with the largest sums
idCols = movieHeader.copy()
idCols.extend(['genres_clean'])
genreCols = movieDat.columns.difference(idCols)
topGenres = movieDat.loc[:, genreCols].sum().nlargest(10).index

In [5]:
# Now filter dataset to those that contain these genres and take top 10,000 highest grossing
movieDat['topGenresOnly'] = movieDat['genres_clean'].apply(lambda x: set(x).intersection(topGenres))
containsGenreBool = movieDat['genres_clean'].apply(any)
movies = movieDat.loc[(containsGenreBool) & movieDat['revenue'].notnull()]
movies = movies.sort_values('revenue', ascending=False)
# finalDat = movies.merge(synopsisDat, how = 'inner', on='wikiID').iloc[:10000]
finalDat = movies.merge(synopsisDat, how = 'inner', on='wikiID')


# Split into X and y and preprocess
X = np.array(finalDat['synopsis'].apply(gensim.utils.simple_preprocess))
y = np.array(finalDat['topGenresOnly'])

all_words = set(w for words in X for w in words)
vocab_size = len(all_words)
embed_size = 200

In [6]:
# For use in keras NN
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 30000
batch_size = 32

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(finalDat['synopsis'])
seq = tokenizer.texts_to_sequences(finalDat['synopsis'])
word_index = tokenizer.word_index

# Cut down to the top X words
index_word = {v: k for k, v in word_index.items() if v < (MAX_NUM_WORDS + 1)}

### Task 2: Split the data

Make a dataset of 70% train and 30% test. Sweet.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=10)

In [8]:
mlb2 = MultiLabelBinarizer()
train_labels = mlb2.fit_transform(y_train)
test_labels = mlb2.transform(y_test)

# x_train = pad_sequences(X_train, maxlen=maxlen, padding='post')
# x_test = pad_sequences(X_test, maxlen=maxlen, padding='post')

### Task 3a: Build a model using ONLY word2vec

Woah what? I don't think that's recommended...

In fact it's a commonly accepted practice. What you will want to do is average the word vectors that will be input for a given synopsis (https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and then input that averaged vector as your feature space into a model. For this example, use a Support Vector Machine classifier. For your first time doing this, train a model in Gensim and use the output vectors.


In [9]:
# Need to convert X_train and X_test to matrix of values by
# substituting each word to get its mean embedded score

def convert_word_mat_to_mean_embed(word_mat, w2v):
    dim = len(next(iter(w2v.values())))
    return np.array([np.mean([w2v[w] for w in words if w in w2v]
                             or [np.zeros(dim)], axis=0)
                     for words in word_mat])

In [10]:
def runMod(mod, gensim_model):
    w2embed = {w: vec for w, vec in zip(gensim_model.wv.index2word, 
                                    gensim_model.wv.syn0)}
    clf = OneVsRestClassifier(mod(random_state=10))
    train_x = convert_word_mat_to_mean_embed(X_train, w2embed)
    clf.fit(train_x, train_labels)
    
    test_x = convert_word_mat_to_mean_embed(X_test, w2embed)
    preds = clf.predict(test_x)
    acc = metrics.accuracy_score(test_labels, preds)
    return (clf, acc)

In [11]:
model_user = gensim.models.Word2Vec(X_train, size=embed_size, 
                                window=5, min_count=5, workers=-1)
model_user.train(X_train,total_examples=len(X_train),epochs=10)

0

In [12]:
svc_user = runMod(LinearSVC, model_user)

### Task 3b: Do the same thing but with pretrained embeddings

Now pull down the Google News word embeddings and do the same thing. Compare the results. Why was one better than the other?

In [13]:
model_w2v = gensim.models.KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [14]:
svc_w2v = runMod(LinearSVC, model_w2v)

In [15]:
print('Accuracy of SVC User: {}'.format(svc_user[1]))
print('Accuracy of SVC W2V: {}'.format(svc_w2v[1]))

Accuracy of SVC User: 0.09538461538461539
Accuracy of SVC W2V: 0.21142857142857144


The Google Word2Vec model is about double as good as the trained model, most likely due to the fact that it was trained on a bigger corpus. That bigger corpus is more likely to include sequences of data that are found in the test data that may not be in the training data.  

That being said, neither model does well in classying genres...

### Task 4: Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras)

In [16]:
def create_int_word_dict(model):
    mapdict = {}
    for i in range(len(model.wv.vocab)):
        word = model.wv.index2word[i]
        mapdict[word] = i
    return mapdict

In [17]:
def create_embed_matrix(model, all_words=all_words, embed_size=embed_size):
    "Create a weight matrix for words"
    vocab_size = len(all_words)
    embedding_matrix = np.zeros((vocab_size, embed_size))
    n = 0
    word_list = list(all_words)
    for i in range(vocab_size):
        word = word_list[i]
        if word in model.wv.vocab:
            embedding_vector = model.wv[word]
            if embedding_vector is not None:
                embedding_matrix[n] = embedding_vector[:embed_size]
                n += 1

    return embedding_matrix[:n, :]


In [18]:
# Convert X datasets from words to numbers
def convert_word_to_num(word_mat, max_len = MAX_SEQUENCE_LENGTH):
    new_mat = []
    for review in word_mat:
        tmp = []
        for w in review:
            if w in word_index:
                tmp.append(word_index[w])

            else:
                tmp.append(0)
        new_mat.append(tmp)

    return pad_sequences(new_mat, padding='post', maxlen=max_len)    

In [19]:
x_train = convert_word_to_num(X_train)
x_test = convert_word_to_num(X_test, max_len=x_train.shape[1])

In [20]:
def run_keras_model1(gensim_model, batch_size=batch_size, 
                     create_emebed_mat = False):
    
    # For embedding layer
    if create_emebed_mat:
        embed_matrix = create_embed_matrix(gensim_model)
        e = Embedding(embed_matrix.shape[0], embed_matrix.shape[1],
                      weights=[embed_matrix],
                     input_length=x_train.shape[1], trainable=False)
    else:
        e = gensim_model.wv.get_keras_embedding()
        e.input_length = x_train.shape[1]    
    
    # define model
    print('Build model...')
    keras_mod = Sequential()
    # e = Embedding(vocab_size, embed_size, weights=[embed_matrix], 
    #               input_length=maxlen, trainable=False)
    keras_mod.add(e)
    keras_mod.add(Flatten())
    keras_mod.add(Dense(10, activation='softmax'))
    # compile the model
    keras_mod.compile(optimizer='adam', loss='categorical_crossentropy',
                      metrics=['acc'])
    # summarize the model
    print(keras_mod.summary())
    # fit the model
    keras_mod.fit(x_train, train_labels,
              batch_size=batch_size,
              epochs=5,
              validation_data=(x_test, test_labels))

    score, acc = keras_mod.evaluate(x_test, test_labels,
                                batch_size=batch_size)
    return (keras_mod, acc)    

In [21]:
keras_user = run_keras_model1(model_user, create_emebed_mat=True)

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1000, 200)         4747800   
_________________________________________________________________
flatten_1 (Flatten)          (None, 200000)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                2000010   
Total params: 6,747,810
Trainable params: 2,000,010
Non-trainable params: 4,747,800
_________________________________________________________________
None
Train on 5308 samples, validate on 2275 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [22]:
keras_w2v = run_keras_model1(model_w2v, 32, create_emebed_mat=True)

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1000, 200)         8458800   
_________________________________________________________________
flatten_2 (Flatten)          (None, 200000)            0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                2000010   
Total params: 10,458,810
Trainable params: 2,000,010
Non-trainable params: 8,458,800
_________________________________________________________________
None
Train on 5308 samples, validate on 2275 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [23]:
print('Accuracy of Keras User: {}'.format(keras_user[1]))
print('Accuracy of Keras W2V: {}'.format(keras_w2v[1]))

Accuracy of Keras User: 0.26461538461538464
Accuracy of Keras W2V: 0.24263736263736263


### Task 5: Change the architecture of your model and compare the result

In [28]:
def run_keras_model2(gensim_model, batch_size=batch_size, 
                     create_emebed_mat = False):
    
    # For embedding layer
    if create_emebed_mat:
        embed_matrix = create_embed_matrix(gensim_model)
        e = Embedding(embed_matrix.shape[0], embed_matrix.shape[1],                      weights=[embed_matrix],
                     input_length=x_train.shape[1], trainable=False)
    else:
        e = gensim_model.wv.get_keras_embedding()
        e.input_length = x_train.shape[1]    
    
    # define model
    print('Build model...')
    keras_mod = Sequential()
    # e = Embedding(vocab_size, embed_size, weights=[embed_matrix], 
    #               input_length=maxlen, trainable=False)
    keras_mod.add(e)
    keras_mod.add(Flatten())
    keras_mod.add(Dense(100, activation='relu'))
    Dropout(.5, noise_shape=None, seed=42)    
    keras_mod.add(Dense(100, activation='relu'))
    Dropout(.5, noise_shape=None, seed=42)    
    keras_mod.add(Dense(100, activation='relu'))
    Dropout(.3, noise_shape=None, seed=42)    
    keras_mod.add(Dense(10, activation='softmax'))
    # compile the model
    keras_mod.compile(optimizer='adam', loss='categorical_crossentropy', 
                      metrics=['acc'])
    # summarize the model
    print(keras_mod.summary())
    # fit the model
    keras_mod.fit(x_train, train_labels,
              batch_size=batch_size,
              epochs=10,
              validation_data=(x_test, test_labels))

    score, acc = keras_mod.evaluate(x_test, test_labels,
                                batch_size=batch_size)
    return (keras_mod, acc)    

In [29]:
keras_user_mod2 = run_keras_model2(model_user, create_emebed_mat = True)

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 1000, 200)         4747800   
_________________________________________________________________
flatten_5 (Flatten)          (None, 200000)            0         
_________________________________________________________________
dense_9 (Dense)              (None, 100)               20000100  
_________________________________________________________________
dense_10 (Dense)             (None, 100)               10100     
_________________________________________________________________
dense_11 (Dense)             (None, 100)               10100     
_________________________________________________________________
dense_12 (Dense)             (None, 10)                1010      
Total params: 24,769,110
Trainable params: 20,021,310
Non-trainable params: 4,747,800
_________________________________________

In [30]:
keras_w2v_mod2 = run_keras_model2(model_w2v, create_emebed_mat = True)

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1000, 200)         8458800   
_________________________________________________________________
flatten_6 (Flatten)          (None, 200000)            0         
_________________________________________________________________
dense_13 (Dense)             (None, 100)               20000100  
_________________________________________________________________
dense_14 (Dense)             (None, 100)               10100     
_________________________________________________________________
dense_15 (Dense)             (None, 100)               10100     
_________________________________________________________________
dense_16 (Dense)             (None, 10)                1010      
Total params: 28,480,110
Trainable params: 20,021,310
Non-trainable params: 8,458,800
_________________________________________

In [31]:
print('Accuracy of Keras Mod 2 User: {}'.format(keras_user_mod2[1]))
print('Accuracy of Keras Mod 2 W2V: {}'.format(keras_w2v_mod2[1]))

Accuracy of Keras Mod 2 User: 0.2681318681318681
Accuracy of Keras Mod 2 W2V: 0.2408791208791209


### Task 6: For each model, do an error evaluation

You now have a bunch of classifiers. For each classifier, pick 2 good classifications and 2 bad classifications. Print the expected and predicted label, and also print the movie synopsis. From these results, can you spot some systematic errors from your models?

In [87]:
classes = np.asarray(mlb2.classes_)

def print_pretty(partition, preds, model_name):
    print('{} Predicted classes: {}'.
           format(model_name, classes[preds[partition].astype(bool)]))
    print('{} Actual classes: {}'.
          format(model_name, classes[test_labels[partition].astype(bool)]))
    print('{} Synopsis:\n{}'.
          format(model_name, ' '.join(X_test[partition])))

def find_mix_max_error(preds):
    num_errs = np.sum(abs(preds - test_labels), axis=1)    
    max_err_indx = np.argpartition(num_errs, -2)[-2:]
    min_err_indx = np.argpartition(num_errs, 2)[:2]
    return (max_err_indx, min_err_indx)

def print_good_and_bad_from_models(model, model_name, x_test):
    pred_prob = model[0].predict(x_test)

    # Define class as top probability plus any that are within 5%
    preds = np.apply_along_axis(lambda x: np.where(x > (max(x) - .05), 1, 0),
                                axis=1, arr=pred_prob)

    # Calculate errors
    max_err_indx, min_err_indx = find_mix_max_error(preds)

    # Print out synopsis and errors
    print('------------------')
    print('--- Max Error ----')
    print('------------------')
    for partition in max_err_indx:
        print('')
        print_pretty(partition, preds, model_name)
        print('')
    

    print('------------------')
    print('--- Min Error ----')
    print('------------------')
    for partition in min_err_indx:
        print('')
        print_pretty(partition, preds, model_name)
        print('')

In [91]:
print_good_and_bad_from_models(keras_user, 'Keras User Mod 1')

------------------
--- Max Error ----
------------------

Keras User Mod 1 Predicted classes: ['Action' 'Comedy' 'Drama' 'Thriller']
Keras User Mod 1 Actual classes: ['Black-and-white' 'Indie']
Keras User Mod 1 Synopsis:
the story begins with the siblings barbra and johnny driving to rural pennsylvania to visit their father grave in the graveyard johnny teases barbra that they re coming to get you barbra and then they are violently attacked by strange man johnny tries to rescue his sister but is killed after he falls and cracks his head on gravestone barbra flees with the zombie in pursuit to farmhouse where to her horror she discovers woman mangled corpse running out of the house she is caught between the house and strange menacing figures that are akin to the zombie in the graveyard man named ben arrives in car and takes her back inside the house ben asks barbra if she lived in the farmhouse but barbra is slowly going into shock hiding in the cellar of the farmhouse are married coupl

In [92]:
print_good_and_bad_from_models(keras_w2v, 'Keras W2V Mod 1')

------------------
--- Max Error ----
------------------

Keras W2V Mod 1 Predicted classes: ['Comedy']
Keras W2V Mod 1 Actual classes: ['Action' 'Crime Fiction' 'Drama' 'Indie' 'Romance Film' 'Thriller']
Keras W2V Mod 1 Synopsis:
comic book store clerk and film buff clarence worley watches sonny chiba triple feature at detroit movie theater for his birthday there he meets alabama whitman seemingly by chance they go to diner for pie and flirt before heading to clarence apartment after having sex she confesses that she is call girl hired by clarence boss as birthday present but she has fallen in love with clarence and he with her the next day they marry alabama pimp drexl spivey makes clarence uneasy an apparition of his idol elvis presley tells him that killing drexl who is also drug dealer will make the world better place clarence tells drexl that he has married alabama and she has no additional business with him drexl and clarence fight and clarence draws gun and kills drexl and henc

In [93]:
print_good_and_bad_from_models(keras_user_mod2, 'Keras User Mod2')

------------------
--- Max Error ----
------------------

Keras User Mod2 Predicted classes: ['Comedy' 'Drama' 'Romance Film']
Keras User Mod2 Actual classes: ['Action' 'Thriller' 'World cinema']
Keras User Mod2 Synopsis:
ryu is deaf mute man working in factory to support his ailing sister who is in desperate need of kidney transplant as ryu is not match and he is laid off from his job ryu contacts black market organ dealer and agrees to exchange his savings and one of his own kidneys in exchange for matching one the dealers perform the operation but disappear after taking ryu kidney and money three weeks later ryu learns from his doctor that donor has been found but ryu is unable to afford the operation now in need of money for the operation and in retaliation for his being fired yeong mi ryu radical anarchist girlfriend conspires they kidnap yu sun the daughter of factory executive dong jin the girl stays with ryu sister who believes ryu is merely babysitting her concurrently ryu and

In [94]:
print_good_and_bad_from_models(keras_w2v_mod2, 'Keras W2V Mod2')

------------------
--- Max Error ----
------------------

Keras W2V Mod2 Predicted classes: ['Action']
Keras W2V Mod2 Actual classes: ['Black-and-white' 'Crime Fiction' 'Drama' 'Indie' 'Thriller']
Keras W2V Mod2 Synopsis:
struggling unemployed young writer takes to following strangers around the streets of london ostensibly to find inspiration for his first novel initially he sets strict rules for himself regarding whom he should follow and for how long but soon discards them as he focuses on well groomed handsome man in dark suit the man in the suit having noticed he is being followed quickly confronts the young man and introduces himself as cobb cobb reveals that he is serial burglar and invites the young man to accompany him on various burglaries the material gains from these crimes seem to be of secondary importance to cobb who takes pleasure in rifling through the personal items in his targets flats and doing things such as drinking their wine he explains that his true passion is 