### Milestone 4: Deep learning, due Wednesday, April 26, 2017

For this milestone you will (finally) use deep learning to predict movie genres. You will train one small network from scratch on the posters only, and compare this one to a pre-trained network that you fine tune. [Here](https://keras.io/getting-started/faq/#how-can-i-use-pre-trained-models-in-keras) is a description of how to use pretrained models in Keras.

You can try different architectures, initializations, parameter settings, optimization methods, etc. Be adventurous and explore deep learning! It can be fun to combine the features learned by the deep learning model with a SVM, or incorporate meta data into your deep learning model. 

**Note:** Be mindful of the longer training times for deep models. Not only for training time, but also for the parameter tuning efforts. You need time to develop a feel for the different parameters and which settings work, which normalization you want to use, which model architecture you choose, etc. 

It is great that we have GPUs via AWS to speed up the actual computation time, but you need to be mindful of your AWS credits. The GPU instances are not cheap and can accumulate costs rather quickly. Think about your model first and do some quick dry runs with a larger learning rate or large batch size on your local machine. 

The notebook to submit this week should at least include:

- Complete description of the deep network you trained from scratch, including parameter settings, performance, features learned, etc. 
- Complete description of the pre-trained network that you fine tuned, including parameter settings, performance, features learned, etc. 
- Discussion of the results, how much improvement you gained with fine tuning, etc. 
- Discussion of at least one additional exploratory idea you pursued

In [71]:
import urllib2
import PIL
import os
import numpy as np

# for image manipulation. Easier to do 
# here than with Keras, as per
# https://piazza.com/class/ivlbdd3nigy3um?cid=818
import PIL.Image as Image
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

## Step One: Extracting Movies From URL 

In [34]:
#train = pd.read_csv("../data/train_full.csv")
train = pd.read_csv("train_full.csv")
train_thinned = pd.read_csv("train.csv")

train.drop("Unnamed: 0", axis=1, inplace=True)
train_thinned.drop("Unnamed: 0", axis=1, inplace=True)

print "Train shape:", train.shape
print "train_thinned shape:", train_thinned.shape

Train shape: (7220, 32)
train_thinned shape: (540, 29)


In [32]:
train.head(1)

Unnamed: 0,10402,10749,10751,10752,10769,10770,12,14,16,18,...,lead actors,movie_id,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,0,0,0,0,0,0,0,0,0,1,...,"[u'Amy Adams', u'Jeremy Renner', u'Forest Whit...",329865,Taking place after alien crafts land around th...,25.66195,/hLudzvGfpi6JlwUnsNhXwKKg4j.jpg,2016-11-10,Arrival,False,6.9,3510


In [33]:
train_thinned.head(1)

Unnamed: 0,10402,10749,10751,10752,12,14,16,18,27,28,...,lead actors,movie_id,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,0,0,1,0,0,0,1,0,0,0,...,"[u'Alec Baldwin', u'Miles Bakshi', u'Jimmy Kim...",295693,A story about how a new baby's arrival impacts...,305.881041,/unPB1iyEeTBcKiLg8W083rlViFH.jpg,2017-03-23,The Boss Baby,False,5.7,510


## Important. 

The line below aliases the DF that we want to work with as `curr_df`. When we decide later on to use the full training set instead of just `train_thinned`, all we need to do is set it in the cell below and re-run the code. This will prevent us from having to find/replace all instances of the past dataframe.

In [76]:
curr_df = train_thinned

In [14]:
## Helper that downloads web images 
## Takes in the poster path and the id of the movie 
## Saves the movie as a jpg as the unique id of the movie 
## In the images folder.
def download_web_image(poster_path, movie_id):
    # given that we're going to resize our images to be 32x32
    # or something else really small, let's download really small images 
    # to start
    base_url = "https://image.tmdb.org/t/p/w92/" 
    
    request = urllib2.Request(base_url + poster_path)
    img = urllib2.urlopen(request).read()
    image_name= "images/" + str(movie_id) + ".jpg"
    
    with open(image_name, 'w') as f: 
        f.write(img)

In [39]:
### iterate through all of the images in the thinned dataset, saving locally 
if 1:
    print "If you actually want to download posters, you'll need to turn the `1` above into a `0`. This code doesn't run by default in the notebook so that you don't accidentally download hundreds of images."
else:
    for index, row in curr_df.iterrows():
        movie_id = row["movie_id"]
        poster_path = row["poster_path"] 
#         download_web_image(poster_path, movie_id)

If you actually want to download posters, you'll need to turn the `1` above into a `0`. This code doesn't run by default in the notebook so that you don't accidentally download hundreds of images.


In [74]:
# convert each normal poster to a 32x32 grayscale poster
for img_name in os.listdir("images/"):
    # read in an image and convert to greyscale
    im = Image.open("images/" + img_name).convert("L")
    out = im.resize((32, 32))
    out.save("nn_ready_images/" + img_name)

# Building a CNN from Scratch

In [114]:
# smaller batch size means noisier gradient, but more updates per epoch
batch_size = 512

# number of iterations over the complete training data
epochs = 100

In [113]:
# now we need training and testing data. in the current state,
# we have a bunch of greyscale images named by their movie ids.
# to get the data, we can first just split all the movie ids (X) in the
# dataframe intro train and test sets, and then grab their multilabel
# matrices (y)
m_ids = curr_df.movie_id.values

# shuffle the ids to get a random sample)
np.random.shuffle(m_ids)

train_size = int(math.floor(.7 * len(m_ids)))

# get the movie_ids (each of which has an image in "nn_images_ready/"
# which is ready to be put through the neural net
train_ids = m_ids[:train_size]
test_ids = m_ids[train_size:]

In [121]:
# these are the column names of the multilabel matrix
label_names = curr_df.columns[:17]

y_train = [curr_df[curr_df.movie_id == movie_id][label_names] for movie_id in train_ids]
y_test  = [curr_df[curr_df.movie_id == movie_id][label_names] for movie_id in test_ids]

In [None]:
# now we need x_train and x_test. Following the example in "labs/Keras_CNN.ipynb", 
# this needs to be an array of images with shape...