## Assignment 2 - Movie Classification, the sequel
![](https://images-na.ssl-images-amazon.com/images/S/sgp-catalog-images/region_US/paramount-01376-Full-Image_GalleryBackground-en-US-1484000188762._RI_SX940_.jpg)


#### In this assignment, we will learn a little more about word2vec and then use the resulting vectors to make some predictions.

We will be working with a movie synopsis dataset, found here: http://www.cs.cmu.edu/~ark/personas/

The overall goal should sound a little familiar - based on the movie synopses, we will classify movie genre. Some of your favorites should be in this dataset, and hopefully, based on the genre specific terminology of the movie synopses, we will be able to figure out which movies are which type.

### Task 1: clean your dataset!

For your input data:

1. Find the top 10 movie genres
2. Remove any synopses that don't fit into these genres
3. Take the top 10,000 reviews in terms of "Movie box office revenue"

Congrats, you've got a dataset! For each movie, some of them may have multiple classifications. To deal with this, you'll have to look at the Reuters dataset classification code that we used previously and possibly this example: https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py

We want to use categorical cross-entropy as our loss function (or a one vs. all classifier in the case of SVM) because our data will potentially have multiple classes!

In [2]:
import os
import ast
import tarfile
import pandas as pd
import numpy as np
from collections import Counter

os.chdir("/home/priya/Downloads/MovieSummaries")


df= pd.read_csv("movie.metadata.tsv", sep="\t", header=0)
df.columns=["Wikipedia movie ID", "Freebase movie ID","Movie name", "Movie release date", 
            "Movie box office revenue", "Movie runtime", "Movie languages", "Movie countries", "Movie genres"]

#Remove null values from Movie box office revenue
df = df[df['Movie box office revenue'].notnull()] 
 
y = df['Movie genres']

y_new =[]
for index,item in enumerate(y):
    m = eval(item)
    y_new.append(list(m.values()))

flat_list = [item for sublist in y_new for item in sublist]
    
    
print(len(flat_list))

flat_list_10 = Counter(flat_list)

top_10_movie_genres = [item[0] for item in flat_list_10.most_common(10)]
print(top_10_movie_genres)
keep_genre = []

for item in y_new:
    genre = list(set(item).intersection(set(top_10_movie_genres)))
    if len(genre)>0:
        keep_genre.append(genre[0])
    else:
        keep_genre.append(np.nan)
        
        
df["Movie genres"] = keep_genre

df = df[df['Movie genres'].notnull()]


42424
['Drama', 'Comedy', 'Romance Film', 'Thriller', 'Action', 'Action/Adventure', 'Crime Fiction', 'Adventure', 'Indie', 'Romantic comedy']


In [3]:

with open("plot_summaries.txt", 'r',encoding='utf-8') as FR:
       plots = FR.readlines()

plots = {x.split('\t')[0]:x.split('\t')[1] for x in plots}

df['plots'] = [plots[str(key)] if str(key) in plots else np.nan for key in df["Wikipedia movie ID"]]

df = df[df['plots'].notnull()]

df.shape

(7195, 10)

In [4]:
for item,values in plots.items():
    print(values)
    break

Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.



In [5]:
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
import string


def getsynopses(series):
    stop = stopwords.words('english') + list(string.punctuation)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(series.lower())
    processed_word_list = [i for i in tokens if i not in stop and len(i)>2]
    return processed_word_list
     
df[['plots']] = df[['plots']].applymap(lambda m:getsynopses(m))
X = list(df['plots'])

In [7]:
import gensim
# let X be a list of tokenized texts (i.e. list of lists of tokens)
model = gensim.models.Word2Vec(X, iter=10, min_count=10, size=200, workers=4)

In [8]:
vocab = dict(zip(model.wv.index2word, model.wv.syn0))
vecArray = []
for words in X:
    w2v_array = []
    for w in words:
        if w in vocab:
            w2v_array.append( vocab[w] )
    vecArray.append(np.mean(w2v_array,axis=0))
vecArray =np.array(vecArray)


  """Entry point for launching an IPython kernel.


In [10]:
vecArray_series = pd.Series(vecArray.tolist())

genre = list(df['Movie genres'])
test= list(df['plots'])
df_data = pd.DataFrame({"X":vecArray_series,"y":genre})
df_data.head()

Unnamed: 0,X,y
0,"[0.27888116240501404, -0.38078123331069946, 0....",Drama
1,"[0.06917796283960342, -0.15979351103305817, 0....",Drama
2,"[0.13169050216674805, 0.05687893182039261, 0.1...",Drama
3,"[-0.03149157762527466, -0.06241234019398689, 0...",Action/Adventure
4,"[-0.04280886799097061, -0.07591108232736588, 0...",Drama


### Task 2: Split the data

Make a dataset of 70% train and 30% test. Sweet.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_data.X,df_data.y,test_size=0.3, random_state=0)

In [12]:
def getSeries(series):
        val_list = []
        val_list.append(series)
        return val_list
        

y_train = y_train.apply(lambda m:getSeries(m))
print(type(y_train))
y_test = y_test.apply(lambda m:getSeries(m))

<class 'pandas.core.series.Series'>


In [13]:
top_movies_list = [['Drama'],
 ['Comedy'],
 ['Romance Film'],
 ['Thriller'],
 ['Action'],
 ['Action/Adventure'],
 ['Crime Fiction'],
 ['Adventure'],
 ['Indie'],
 ['Romantic comedy']]

### Task 3a: Build a model using ONLY word2vec

Woah what? I don't think that's recommended...

In fact it's a commonly accepted practice. What you will want to do is average the word vectors that will be input for a given synopsis (https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and then input that averaged vector as your feature space into a model. For this example, use a Support Vector Machine classifier. For your first time doing this, train a model in Gensim and use the output vectors.


In [14]:
from sklearn.preprocessing import MultiLabelBinarizer

binarizer = MultiLabelBinarizer()
binarizer.fit(top_movies_list)

train_label= binarizer.fit_transform(y_train)
test_label = binarizer.fit_transform(y_test)
y_train = train_label
y_test = test_label

train_list  = []
for val in X_train:
    train_list.append(np.array(val))

final_train_list = np.array(train_list)
print(type(final_train_list))
#print(final_train_list)
test_list  = []
for val in X_test:
    test_list.append(np.array(val))

final_test_list = np.array(test_list)
#print(final_test_list)



<class 'numpy.ndarray'>


In [15]:
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

classifier = OneVsRestClassifier(LinearSVC(random_state=42))
model = classifier.fit(final_train_list, y_train)


predictions = model.predict(final_test_list)

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='macro')
    recall = recall_score(test_labels, predictions, average='macro')
    accuracy = accuracy_score(test_labels,predictions)

    print("Precision: {:.4f}, Recall: {:.4f}, Accuracy: {:.4f}".format(precision, recall,accuracy))

In [16]:
evaluate(y_test, predictions)

Precision: 0.2972, Recall: 0.1186, Accuracy: 0.4553


  'precision', 'predicted', average, warn_for)


### Task 3b: Do the same thing but with pretrained embeddings

Now pull down the Google News word embeddings and do the same thing. Compare the results. Why was one better than the other?

In [21]:
from gensim import models
model2 = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
#model2 = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [22]:
import os
import ast
import tarfile
import pandas as pd
import numpy as np
from collections import Counter

df= pd.read_csv("movie.metadata.tsv", sep="\t", header=0)
df.columns=["Wikipedia movie ID", "Freebase movie ID","Movie name", "Movie release date", 
            "Movie box office revenue", "Movie runtime", "Movie languages", "Movie countries", "Movie genres"]

#Remove null values from Movie box office revenue
df = df[df['Movie box office revenue'].notnull()]
y = df['Movie genres']

y_new =[]
for index,item in enumerate(y):
    m = eval(item)
    y_new.append(list(m.values()))

flat_list = [item for sublist in y_new for item in sublist]
    
    
print(len(flat_list))
flat_list_10 = Counter(flat_list)

top_10_movie_genres = [item[0] for item in flat_list_10.most_common(10)]
print(top_10_movie_genres)
keep_genre = []

for item in y_new:
    genre = list(set(item).intersection(set(top_10_movie_genres)))
    if len(genre)>0:
        keep_genre.append(genre[0])
    else:
        keep_genre.append(np.nan)
        
        
df["Movie genres"] = keep_genre

df = df[df['Movie genres'].notnull()]

with open("plot_summaries.txt", 'r',encoding='utf-8') as FR:
       plots = FR.readlines()

plots = {x.split('\t')[0]:x.split('\t')[1] for x in plots}

df['plots'] = [plots[str(key)] if str(key) in plots else np.nan for key in df["Wikipedia movie ID"]]
df = df[df['plots'].notnull()]
df.shape


42424
['Drama', 'Comedy', 'Romance Film', 'Thriller', 'Action', 'Action/Adventure', 'Crime Fiction', 'Adventure', 'Indie', 'Romantic comedy']


(7195, 10)

In [23]:
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
import string


def getsynopses(series):
    stop = stopwords.words('english') + list(string.punctuation)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(series.lower())
    processed_word_list = [i for i in tokens if i not in stop and len(i)>2]
    return processed_word_list
     


df[['plots']] = df[['plots']].applymap(lambda m:getsynopses(m))


In [24]:
X = list(df['plots'])
vocab2_words = model2.vocab.keys()
vocab2 = {}
for word in vocab2_words:
    vocab2[word] = model2[word]
vecArray = []
for words in X:
    w2v_array = []
    for w in words:
        if w in vocab2:
            w2v_array.append( vocab2[w] )
    vecArray.append(np.mean(w2v_array,axis=0))
vecArray =np.array(vecArray)


In [26]:
vecArray_series = pd.Series(vecArray.tolist())
genre = list(df['Movie genres'])
df_data = pd.DataFrame({"X":vecArray_series,"y":genre})


In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_data.X,df_data.y,
                                                    test_size=0.3,stratify=df_data.y, random_state=0)

In [28]:
def getSeries(series):
        val_list = []
        val_list.append(series)
        return val_list
        
              
y_train = y_train.apply(lambda m:getSeries(m))
y_test = y_test.apply(lambda m:getSeries(m))

In [29]:
top_movies_list = [['Drama'],
 ['Comedy'],
 ['Romance Film'],
 ['Thriller'],
 ['Action'],
 ['Action/Adventure'],
 ['Crime Fiction'],
 ['Adventure'],
 ['Indie'],
 ['Romantic comedy']]

In [30]:
from sklearn.preprocessing import MultiLabelBinarizer

binarizer = MultiLabelBinarizer()
binarizer.fit(top_movies_list)


train_label = binarizer.fit_transform(y_train)
test_label = binarizer.fit_transform(y_test)

y_train = train_label
y_test = test_label

train_list  = []

for val in X_train:    
    train_list.append(np.array(val))

final_train_list = np.array(train_list)

test_list  = []
for val in X_test:
    test_list.append(np.array(val))

final_test_list = np.array(test_list)




In [32]:
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

classifier = OneVsRestClassifier(LinearSVC(random_state=42))
model = classifier.fit(final_train_list, y_train)


predictions = model.predict(final_test_list)

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='macro')
    recall = recall_score(test_labels, predictions, average='macro')
    accuracy = accuracy_score(test_labels,predictions)

    print("Precision: {:.4f}, Recall: {:.4f}, Accuracy: {:.4f}".format(precision, recall,accuracy))

In [33]:
evaluate(y_test, predictions)

Precision: 0.2946, Recall: 0.1053, Accuracy: 0.4423


  'precision', 'predicted', average, warn_for)


### Task 3: Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras)

In [37]:
from keras.models import Sequential
from keras.layers import Dense, Activation

model_textual = Sequential([
    Dense(300, input_shape=(5036,1)),
    Activation('relu'),
    Dense(24),
    Activation('softmax'),
])

model_textual.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model_textual.fit(X_train, y_train, epochs=10, batch_size=500)

In [81]:
score1 = model_textual.evaluate(X_test, Y_test, batch_size=249)



In [82]:
print("%s: %.2f%%" % (model_textual.metrics_names[1], score1[1]*100))

acc: 73.34%


### Task 4: Change the architecture of your model and compare the result

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

model = Sequential()
model.add(Dense(300, input_shape=(300,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(24))
model.add(Activation('softmax'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [67]:
model.fit(X_train, Y_train, epochs=10000, batch_size=500,verbose=0)

<keras.callbacks.History at 0x27297eec710>

In [68]:
score = model_textual.evaluate(X_test, Y_test, batch_size=249)



### Task 5: For each model, do an error evaluation

You now have a bunch of classifiers. For each classifier, pick 2 good classifications and 2 bad classifications. Print the expected and predicted label, and also print the movie synopsis. From these results, can you spot some systematic errors from your models?

In [69]:
print("%s: %.2f%%" % (model_textual.metrics_names[1], score[1]*100))

acc: 74.23%


In [75]:
Y_preds=model_textual.predict(X_test)