In [1]:
%autosave 0

Autosave disabled


## Assignment 2 - Movie Classification, the sequel
![](https://images-na.ssl-images-amazon.com/images/S/sgp-catalog-images/region_US/paramount-01376-Full-Image_GalleryBackground-en-US-1484000188762._RI_SX940_.jpg)


#### In this assignment, we will learn a little more about word2vec and then use the resulting vectors to make some predictions.

We will be working with a movie synopsis dataset, found here: http://www.cs.cmu.edu/~ark/personas/

The overall goal should sound a little familiar - based on the movie synopses, we will classify movie genre. Some of your favorites should be in this dataset, and hopefully, based on the genre specific terminology of the movie synopses, we will be able to figure out which movies are which type.

### Task 1: clean your dataset!

For your input data:

1. Find the top 10 movie genres
2. Remove any synopses that don't fit into these genres
3. Take the top 10,000 reviews in terms of "Movie box office revenue"

Congrats, you've got a dataset! For each movie, some of them may have multiple classifications. To deal with this, you'll have to look at the Reuters dataset classification code that we used previously and possibly this example: https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py

We want to use categorical cross-entropy as our loss function (or a one vs. all classifier in the case of SVM) because our data will potentially have multiple classes!

In [2]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.pipeline import Pipeline
from sklearn import metrics
import gensim
import word2vec

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
dataPath = '../data/MovieSummaries/'
movieHeader = ['wikiID', 'freebaseID', 'name', 'releaseDate', 'revenue',
               'runtime', 'languages', 'countries', 'genres']
movieDat = pd.read_csv(dataPath + 'movie.metadata.tsv', delimiter = '\t',
                      header = None, names = movieHeader)
synopsisDat = pd.read_csv(dataPath + 'plot_summaries.txt', delimiter = '\t',
                      header = None, names = ['wikiID', 'synopsis'])

In [4]:
# To find top genres, will split the genres into their own columns
# since one move can be multiple genres and then sum those columns
# to find the max 10

### Step 1 -- Convert Genres into a list of genres
def cleanGenres(genreDat):
    clean = [re.findall(r'"\S+": "(.+)"', x) for x in genreDat.split(',')]
    return [item for sublist in clean for item in sublist]
    
movieDat['genres_clean'] = movieDat.genres.apply(cleanGenres)

### Step 2 --- "One Hot Encode" that list and then join back
mlb = MultiLabelBinarizer()
movieDat = movieDat.join(pd.DataFrame(mlb.fit_transform(movieDat['genres_clean']),
                          columns=mlb.classes_,
                          index=movieDat.index))

### Step 3 --- Find genres with the largest sums
idCols = movieHeader.copy()
idCols.extend(['genres_clean'])
genreCols = movieDat.columns.difference(idCols)
topGenres = movieDat.loc[:, genreCols].sum().nlargest(10).index

In [5]:
# Now filter dataset to those that contain these genres and take top 10,000 highest grossing
movieDat['topGenresOnly'] = movieDat['genres_clean'].apply(lambda x: set(x).intersection(topGenres))
containsGenreBool = movieDat['genres_clean'].apply(any)
movies = movieDat.loc[containsGenreBool].sort_values('revenue', ascending=False)
finalDat = movies.merge(synopsisDat, how = 'inner', on='wikiID').iloc[:10000]

# Split into X and y and preprocess
X = np.array(finalDat['synopsis'].apply(gensim.utils.simple_preprocess))
y = np.array(finalDat['topGenresOnly'])

### Task 2: Split the data

Make a dataset of 70% train and 30% test. Sweet.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=10)

In [7]:
mlb2 = MultiLabelBinarizer()
train_labels = mlb2.fit_transform(y_train)
test_labels = mlb2.transform(y_test)

### Task 3a: Build a model using ONLY word2vec

Woah what? I don't think that's recommended...

In fact it's a commonly accepted practice. What you will want to do is average the word vectors that will be input for a given synopsis (https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and then input that averaged vector as your feature space into a model. For this example, use a Support Vector Machine classifier. For your first time doing this, train a model in Gensim and use the output vectors.


In [8]:
model = gensim.models.Word2Vec (X_train, size=100, window=5, min_count=5, workers=10)
model.train(X_train,total_examples=len(X_train),epochs=10)
w2v = {w: vec for w, vec in zip(model.wv.index2word, model.wv.syn0)}

In [9]:
# Need to convert X_train and X_test to matrix of values by
# substituting each word to get its mean embedded score
def convert_word_mat_to_num(word_mat, w2v):
    dim = len(next(iter(w2v.values())))
    return np.array([np.mean([w2v[w] for w in words if w in w2v]
                             or [np.zeros(dim)], axis=0)
                     for words in word_mat])

In [None]:
def runMod(mod, embedding):
    clf = OneVsRestClassifier(mod(random_state=10))
    train_x = convert_word_mat_to_num(X_train, embedding)
    clf.fit(train_x, train_labels)
    
    test_x = convert_word_mat_to_num(X_test, embedding)
    preds = clf.predict(test_x)
    acc = metrics.accuracy_score(test_labels, preds)
    return (clf, acc)

In [None]:
svc_w2v = runMod(SVC, w2v)

### Task 3b: Do the same thing but with pretrained embeddings

Now pull down the Google News word embeddings and do the same thing. Compare the results. Why was one better than the other?

In [None]:
model_goog = gensim.models.KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin.gz', binary=True)
w2v_goog = {w: vec for w, vec in zip(model_goog.wv.index2word, model_goog.wv.syn0)}

In [91]:
svc_w2v = runMod(SVC, w2v_goog)

### Task 4: Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras)

### Task 5: Change the architecture of your model and compare the result

### Task 6: For each model, do an error evaluation

You now have a bunch of classifiers. For each classifier, pick 2 good classifications and 2 bad classifications. Print the expected and predicted label, and also print the movie synopsis. From these results, can you spot some systematic errors from your models?