## Assignment 2 - Movie Classification, the sequel
![](https://images-na.ssl-images-amazon.com/images/S/sgp-catalog-images/region_US/paramount-01376-Full-Image_GalleryBackground-en-US-1484000188762._RI_SX940_.jpg)


#### In this assignment, we will learn a little more about word2vec and then use the resulting vectors to make some predictions.

We will be working with a movie synopsis dataset, found here: http://www.cs.cmu.edu/~ark/personas/

The overall goal should sound a little familiar - based on the movie synopses, we will classify movie genre. Some of your favorites should be in this dataset, and hopefully, based on the genre specific terminology of the movie synopses, we will be able to figure out which movies are which type.

### Task 1: clean your dataset!

For your input data:

1. Find the top 10 movie genres
2. Remove any synopses that don't fit into these genres
3. Take the top 10,000 reviews in terms of "Movie box office revenue"

Congrats, you've got a dataset! For each movie, some of them may have multiple classifications. To deal with this, you'll have to look at the Reuters dataset classification code that we used previously and possibly this example: https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py

We want to use categorical cross-entropy as our loss function (or a one vs. all classifier in the case of SVM) because our data will potentially have multiple classes!

In [6]:
import os
import ast
import tarfile
import pandas as pd
import numpy as np
from collections import Counter

os.chdir(r"C:\Users\Deepak\Desktop\II SEMESTER\NLP\MovieSummaries")


movie = pd.read_csv("movie.metadata.tsv",sep="\t",header=None)



movie_header = ["wikipedia_movie_id", "freebase_movie_id", "movie_name",
               "movie_release_date", "movie_box_office_revenue",
               "movie_runtime", "movie_languages", "movie_countries",
               "movie_genres"]
movie.columns = movie_header
movie.head()

# Remove the NaN on the basis of Movie box office revenue
movie = movie[movie['movie_box_office_revenue'].notnull()]

def getVal(series):
        aa_dict = ast.literal_eval(series)
        val_list = []
        for val in aa_dict.values():
            val_list.append(val)
        return val_list
        
              
movie[["movie_languages"]] = movie[["movie_languages"]].applymap(lambda m:getVal(m))
movie[[ "movie_countries"]] = movie[[ "movie_countries"]].applymap(lambda m:getVal(m))
movie[["movie_genres"]] = movie[['movie_genres']].applymap(lambda m:getVal(m))



all_genre = list(movie['movie_genres'])

all_genre_flat = [item for sublist in all_genre for item in sublist]

movie_genre_count = Counter(all_genre_flat)        

top_10_movie_genres = [item[0] for item in movie_genre_count.most_common(10)]

keep_genre = []

for item in all_genre:
    genre = list(set(item).intersection(set(top_10_movie_genres)))
    if len(genre)>0:
        keep_genre.append(genre[0])
    else:
        keep_genre.append(np.nan)
        
movie['movie_genres'] = keep_genre

movie = movie[movie['movie_genres'].notnull()]

with open("plot_summaries.txt", 'r',encoding='utf-8') as FR:
       synopses = FR.readlines()

synopses = {x.split('\t')[0]:x.split('\t')[1] for x in synopses}

movie['synopses'] = [synopses[str(key)] if str(key) in synopses else np.nan for key in movie["wikipedia_movie_id"]]

movie = movie[movie['synopses'].notnull()]

In [7]:
for item,values in synopses.items():
    print(values)
    break

Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.



In [3]:
#df = movie.iloc[0:10,:]

from nltk import RegexpTokenizer
from nltk.corpus import stopwords
import string


def getsynopses(series):
    stop = stopwords.words('english') + list(string.punctuation)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(series.lower())
    processed_word_list = [i for i in tokens if i not in stop and len(i)>2]
    return processed_word_list
     
#movie[['synopses']].applymap(lambda m:getsynopses(m))

movie[['synopses']] = movie[['synopses']].applymap(lambda m:getsynopses(m))

In [4]:
X = list(movie['synopses'])

In [5]:
import gensim
# let X be a list of tokenized texts (i.e. list of lists of tokens)
model = gensim.models.Word2Vec(X, iter=10, min_count=10, size=200, workers=4)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))



In [6]:
vecArray = np.array([np.mean([w2v[w] for w in words if w in w2v] or [np.zeros(len(w2v))], axis=0) for words in X])

In [131]:
vecArray_series = pd.Series(vecArray.tolist())

genre = list(movie['movie_genres'])

pandas.core.series.Series

In [132]:
df = pd.DataFrame({"X":test,"y":genre})

### Task 2: Split the data

Make a dataset of 70% train and 30% test. Sweet.

In [143]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.X,df.y,test_size=0.3,stratify=df.y, random_state=0)

In [144]:
def getSeries(series):
        val_list = []
        val_list.append(series)
        return val_list
        
              
y_train = y_train.apply(lambda m:getSeries(m))
y_test = y_test.apply(lambda m:getSeries(m))

In [147]:
top_movies_list = [['Drama'],
 ['Comedy'],
 ['Romance Film'],
 ['Thriller'],
 ['Action'],
 ['Action/Adventure'],
 ['Crime Fiction'],
 ['Adventure'],
 ['Indie'],
 ['Romantic comedy']]

### Task 3a: Build a model using ONLY word2vec

Woah what? I don't think that's recommended...

In fact it's a commonly accepted practice. What you will want to do is average the word vectors that will be input for a given synopsis (https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) and then input that averaged vector as your feature space into a model. For this example, use a Support Vector Machine classifier. For your first time doing this, train a model in Gensim and use the output vectors.


In [148]:
from sklearn.preprocessing import MultiLabelBinarizer

binarizer = MultiLabelBinarizer()
binarizer.fit(top_movies_list)


train_label = binarizer.fit_transform(y_train)
test_label = binarizer.fit_transform(y_test)

y_train = train_label
y_test = test_label

train_list  = []
for val in X_train:
    test.append(np.array(val))

final_train_list = np.array(train_list)

test_list  = []
for val in X_test:
    test3.append(np.array(val))

final_test_list = np.array(test_list)

from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

classifier = OneVsRestClassifier(LinearSVC(random_state=42))
model = classifier.fit(final_train_list, y_train)


predictions = model.predict(final_test_list)

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='macro')
    recall = recall_score(test_labels, predictions, average='macro')
    accuracy = accuracy_score(test_labels,predictions)

    print("Precision: {:.4f}, Recall: {:.4f}, Accuracy: {:.4f}".format(precision, recall,accuracy))

MultiLabelBinarizer(classes=None, sparse_output=False)

In [207]:
evaluate(y_test, predictions)

Precision: 0.3870, Recall: 0.0862, Accuracy: 0.1163


  'precision', 'predicted', average, warn_for)


### Task 3b: Do the same thing but with pretrained embeddings

Now pull down the Google News word embeddings and do the same thing. Compare the results. Why was one better than the other?

In [2]:
from gensim import models

model2 = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)



In [92]:
import os
import ast
import tarfile
import pandas as pd
import numpy as np
from collections import Counter

movie = pd.read_csv("movie.metadata.tsv",sep="\t",header=None)



movie_header = ["wikipedia_movie_id", "freebase_movie_id", "movie_name",
               "movie_release_date", "movie_box_office_revenue",
               "movie_runtime", "movie_languages", "movie_countries",
               "movie_genres"]
movie.columns = movie_header
movie.head()

# Remove the NaN on the basis of Movie box office revenue
movie = movie[movie['movie_box_office_revenue'].notnull()]

def getVal(series):
        aa_dict = ast.literal_eval(series)
        val_list = []
        for val in aa_dict.values():
            val_list.append(val)
        return val_list
        
              
movie[["movie_languages"]] = movie[["movie_languages"]].applymap(lambda m:getVal(m))
movie[[ "movie_countries"]] = movie[[ "movie_countries"]].applymap(lambda m:getVal(m))
movie[["movie_genres"]] = movie[['movie_genres']].applymap(lambda m:getVal(m))



all_genre = list(movie['movie_genres'])

all_genre_flat = [item for sublist in all_genre for item in sublist]

movie_genre_count = Counter(all_genre_flat)        

top_10_movie_genres = [item[0] for item in movie_genre_count.most_common(10)]

keep_genre = []

for item in all_genre:
    genre = list(set(item).intersection(set(top_10_movie_genres)))
    if len(genre)>0:
        keep_genre.append(genre[0])
    else:
        keep_genre.append(np.nan)
        
movie['movie_genres'] = keep_genre

movie = movie[movie['movie_genres'].notnull()]

with open("plot_summaries.txt", 'r',encoding='utf-8') as FR:
       synopses = FR.readlines()

synopses = {x.split('\t')[0]:x.split('\t')[1] for x in synopses}

movie['synopses'] = [synopses[str(key)] if str(key) in synopses else np.nan for key in movie["wikipedia_movie_id"]]

movie = movie[movie['synopses'].notnull()]

In [18]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import string 

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = stopwords.words('english') + list(string.punctuation)

movie_mean_wordvec=np.zeros((len(movie),300))
movie_mean_wordvec.shape

movie_genres_list = list(movie['movie_genres'])
synopses_list = list(movie['synopses'])

In [37]:
# generate the movie mean wordvec

genres=[]
rows_to_delete=[]
for i in range(len(movie)):
    movie_genres=movie_genres_list[i]
    genres.append(movie_genres)
    overview=synopses_list[i]
    tokens = tokenizer.tokenize(overview)
    stopped_tokens = [k for k in tokens if not k in en_stop]
    count_in_vocab=0
    s=0
    if len(stopped_tokens)==0:
        rows_to_delete.append(i)
        genres.pop(-1)
    else:
        for tok in stopped_tokens:
            if tok.lower() in model2.vocab:
                count_in_vocab+=1
                s+=model2[tok.lower()]
        if count_in_vocab!=0:
            movie_mean_wordvec[i]=s/float(count_in_vocab)
        else:
            rows_to_delete.append(i)
            genres.pop(-1)

In [40]:
# prepare the data for model

from sklearn.preprocessing import MultiLabelBinarizer

mask2=[]
for row in range(len(movie_mean_wordvec)):
    if row in rows_to_delete:
        mask2.append(False)
    else:
        mask2.append(True)

X=movie_mean_wordvec[mask2]


mlb=MultiLabelBinarizer()
Y=mlb.fit_transform(genres)

mask_text=np.random.rand(len(X))<0.8

X_train=X[mask_text]
Y_train=Y[mask_text]
X_test=X[~mask_text]
Y_test=Y[~mask_text]

### Task 3: Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras)

In [65]:
# build a model

from keras.models import Sequential
from keras.layers import Dense, Activation

model_textual = Sequential([
    Dense(300, input_shape=(300,)),
    Activation('relu'),
    Dense(24),
    Activation('softmax'),
])

model_textual.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [73]:
model_textual.fit(X_train, Y_train, epochs=10, batch_size=500)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x272982196a0>

In [81]:
score1 = model_textual.evaluate(X_test, Y_test, batch_size=249)



In [82]:
print("%s: %.2f%%" % (model_textual.metrics_names[1], score1[1]*100))

acc: 73.34%


### Task 4: Change the architecture of your model and compare the result

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

model = Sequential()
model.add(Dense(300, input_shape=(300,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(24))
model.add(Activation('softmax'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [67]:
model.fit(X_train, Y_train, epochs=10000, batch_size=500,verbose=0)

<keras.callbacks.History at 0x27297eec710>

In [68]:
score = model_textual.evaluate(X_test, Y_test, batch_size=249)



### Task 5: For each model, do an error evaluation

You now have a bunch of classifiers. For each classifier, pick 2 good classifications and 2 bad classifications. Print the expected and predicted label, and also print the movie synopsis. From these results, can you spot some systematic errors from your models?

In [69]:
print("%s: %.2f%%" % (model_textual.metrics_names[1], score[1]*100))

acc: 74.23%


In [75]:
Y_preds=model_textual.predict(X_test)