In [89]:
from __future__ import print_function

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')

import numpy as np

First we load in the data. There are two unique datasets in this analysis because one includes only the RGB color data and the other utilizes all of the quantitative metadata we scraped. 

In [98]:
# getting RGB data for MLP 

x_train = np.genfromtxt('train1_RGB.csv', delimiter=',', skip_header = 1)
y_train = np.genfromtxt('train1_y.csv', delimiter=',', skip_header = 1)

x_train_full = np.genfromtxt('train1_x.csv', delimiter=',', skip_header = 1)
# print(x_train_full)

x_test = np.genfromtxt('test1_RGB.csv', delimiter=',', skip_header = 1)
y_test = np.genfromtxt('test1_y.csv', delimiter=',', skip_header = 1)

x_test2 = np.genfromtxt('test2_x.csv', delimiter=',', skip_header = 1)
y_test2 = np.genfromtxt('test2_y.csv', delimiter=',', skip_header = 1)


In **model** we only utilize the color data (RGB means and sd for posters). Here, we utilized the Sequential model in Keras in order to create an MLP model for multi-level softmax classification. Instead of using the "relu" activation parameter, we chose to utilize the "sigmoid" activation because it was better across the board. We experimented with the learning rate by hand to examine the process along the epochs. Experimenting with (1e-5,1e-4,1e-3,.01,.1) we concluded that .01 was the best learning rate. We decided on the .01 rate by looking at the training data even though when evaluated with the test data the total accuracy was roughly the same. We also experimented with some momentum values (.5 - .99), but .9 seemed to be the best. They were often rougly the same, but other values seemed to have more volatile output accuracies. So, .9 seemed to be a pretty typical value that was reliable.  

Overall, the accuracy for the color vectors is not very good. On the test set it comes out to about 20% accuracy. However, this is reasonable because our color analysis is extremely elementary. We scraped the mean value for each of the three RGB vectors and also recorded its standard deviation. You can imgaine many scenarios where this kind of analysis fails to pick up differences between genres or the character of the movie, but it may have some predictive power when added to a larger deep learning model for the posters themselves. 

In [91]:
model = Sequential()


# try relu or sigmoid
model.add(Dense(64, activation = 'sigmoid', input_dim = 6))
model.add(Dropout(.5))
model.add(Dense(64, activation = 'sigmoid'))
model.add(Dropout(.5))
model.add(Dense(18, activation = 'softmax'))

In [131]:
learn = .01
decay_rate = 1e-6
mom = .9
sgd = SGD(lr = learn, decay = decay_rate, momentum = mom , nesterov = True)
model.compile(loss = 'categorical_crossentropy', optimizer = sgd, metrics = ['accuracy'])

In [132]:
model.fit(x_train,y_train, epochs=20, batch_size = 128)
score = model.evaluate(x_test, y_test, batch_size = 300)
score

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[7.5514817237854004, 0.19666667282581329]

In **model_full** we utilize all meta_data scraped from movies. This includes all color data as well as budget, release year, popularity score, revenue, and runtime. 

Currently getting odd behavior with a lack of loss calculation... Unsure what is causing this. And it always ends up with the same final train and test accuracies no matter the hyperparameter settings. 

Although model_full has higher test accuracy, I am in favor of using the first model because this behavior is odd and we cannot seem to tune hyperparameters... 

In [107]:
# MAKING MODEL WITH ALL META-DATA THAT WE HAVE 


model_full = Sequential()


# try relu or sigmoid
# sigmoid seems to be better
model_full.add(Dense(50, activation = 'sigmoid', input_dim = 11))
model_full.add(Dropout(.5))
model_full.add(Dense(50, activation = 'sigmoid'))
model_full.add(Dropout(.5))
model_full.add(Dense(18, activation = 'softmax'))

learn = .01
decay_rate = 1e-6
mom = .9
sgd = SGD(lr = learn, decay = decay_rate, momentum = mom , nesterov = True)
model_full.compile(loss = 'categorical_crossentropy', optimizer = sgd, metrics = ['accuracy'])

model_full.fit(x_train_full,y_train, epochs=20, batch_size = 100)

score_full = model_full.evaluate(x_test2, y_test2, batch_size = 300)
score_full

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[nan, 0.21666666865348816]