### Milestone 4: Deep learning, due Wednesday, April 26, 2017

For this milestone you will (finally) use deep learning to predict movie genres. You will train one small network from scratch on the posters only, and compare this one to a pre-trained network that you fine tune. [Here](https://keras.io/getting-started/faq/#how-can-i-use-pre-trained-models-in-keras) is a description of how to use pretrained models in Keras.

You can try different architectures, initializations, parameter settings, optimization methods, etc. Be adventurous and explore deep learning! It can be fun to combine the features learned by the deep learning model with a SVM, or incorporate meta data into your deep learning model. 

**Note:** Be mindful of the longer training times for deep models. Not only for training time, but also for the parameter tuning efforts. You need time to develop a feel for the different parameters and which settings work, which normalization you want to use, which model architecture you choose, etc. 

It is great that we have GPUs via AWS to speed up the actual computation time, but you need to be mindful of your AWS credits. The GPU instances are not cheap and can accumulate costs rather quickly. Think about your model first and do some quick dry runs with a larger learning rate or large batch size on your local machine. 

The notebook to submit this week should at least include:

- Complete description of the deep network you trained from scratch, including parameter settings, performance, features learned, etc. 
- Complete description of the pre-trained network that you fine tuned, including parameter settings, performance, features learned, etc. 
- Discussion of the results, how much improvement you gained with fine tuning, etc. 
- Discussion of at least one additional exploratory idea you pursued

In [1]:
import time
import seaborn as sns
import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline

## Installed by running this line in terminal: pip install IMDbPY
## Tutorial found here http://imdbpy.sourceforge.net/support.html
import imdb

### Downloaded this via this line: pip install tmdbsimple
## Tutorial found here https://pypi.python.org/pypi/tmdbsimple
import tmdbsimple as tmdb 
import urllib2



In [2]:
## Pass in our tmdb Key 
tmdb.API_KEY = '352e668a0df90032e0f1097459228131'

# Create the object that will be used to access the IMDb's database.
ia = imdb.IMDb()

## Step One: Extracting Movies From URL 

In [3]:
#train = pd.read_csv("../data/train_full.csv")
train = pd.read_csv("train_full.csv")
train_thinned = pd.read_csv("train.csv")

In [4]:
train.head()


Unnamed: 0.1,Unnamed: 0,10402,10749,10751,10752,10769,10770,12,14,16,...,lead actors,movie_id,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,0,0,0,0,0,0,0,0,0,0,...,"[u'Amy Adams', u'Jeremy Renner', u'Forest Whit...",329865,Taking place after alien crafts land around th...,25.66195,/hLudzvGfpi6JlwUnsNhXwKKg4j.jpg,2016-11-10,Arrival,False,6.9,3510
1,1,0,0,0,0,0,0,0,0,0,...,"[u'Ben Affleck', u'Rosamund Pike', u'Carrie Co...",210577,With his wife's disappearance having become th...,13.126754,/gdiLTof3rbPDAmPaCf4g6op46bj.jpg,2014-10-01,Gone Girl,False,7.9,4669
2,2,0,0,0,0,0,0,1,0,0,...,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...",27205,"Cobb, a skilled thief who commits corporate es...",12.11045,/qmDpIHrmpJINaRKAfWQfftjCdyi.jpg,2010-07-14,Inception,False,8.0,11276
3,3,0,0,0,0,0,0,0,0,0,...,"[u""Dylan O'Brien"", u'Ki Hong Lee', u'Kaya Scod...",198663,"Set in a post-apocalyptic world, young Thomas ...",11.037776,/coss7RgL0NH6g4fC2s5atvf3dFO.jpg,2014-09-10,The Maze Runner,False,7.0,4125
4,4,0,0,0,0,0,0,0,0,0,...,"[u'Samuel L. Jackson', u'Kurt Russell', u'Jenn...",273248,Bounty hunters seek shelter from a raging bliz...,9.701425,/fqe8JxDNO8B8QfOGTdjh6sPCdSC.jpg,2015-12-25,The Hateful Eight,False,7.5,3162


In [5]:
## This is taken from train.head() 
movie_id = 329865
movie = tmdb.Movies(movie_id)

In [6]:
base_url = "https://image.tmdb.org/t/p/w500/" 
file_size = "w500" 
poster_path = "kqjL17yufvn9OVLyXYpvtyrFfak.jpg"

In [7]:
base_url = "https://image.tmdb.org/t/p/w500/" 
file_size = "w500" 
poster_path = "kqjL17yufvn9OVLyXYpvtyrFfak.jpg"


## Helper that downloads web images 
## Takes in the poster path and the id of the movie 
## Saves the movie as a jpg as the unique id of the movie 
### In the images folder.
def download_web_image(poster_path, movie_id):
    url = base_url + poster_path
    request = urllib2.Request(url)
    img = urllib2.urlopen(request).read()
    ## convert movie id to a string 
    movie_id = str(movie_id)
    ## append jpg to the name 
    image_name= "images/" + str(movie_id + ".jpg")
    with open (image_name, 'w') as f: f.write(img)

In [8]:
## Example donwload of image 
download_web_image(poster_path, 329865)

In [9]:
train_thinned = train[0:100]

In [10]:
train_thinned.shape

(100, 33)

In [11]:
### iterate through all of the images in the thinned dataset, saving locally 

### I used train_thinned because I didn't want to download 7000 posters 
### I'm not entirely sure if this is how we want to represent our posters when using deeplearning
### but  it should be a decent stop gap solution for now 
for index, row in train_thinned.iterrows():
    movie_id = row["movie_id"]
    poster_path = row["poster_path"] 
    download_web_image(poster_path, movie_id)

In [None]:
## We now have 